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Abstract 

This paper proves, in very general settings, that convex risk minimization is a procedure to select 
a unique conditional probability model determined by the classification problem. Unlike most 
previous work, we give results that are general enough to include cases in which no minimum exists, 
as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that 
any sequence of predictors minimizing convex risk over the source distribution will converge to this 
unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, 
we show the same result holds for empirical risk minimization whenever this class of predictors is 
finite dimensional, where the essential technical contribution is a norm-free generalization bound. 
Keywords: Convex duality, classification, conditional probability estimation, maximum entropy, 
consistency, Orlicz spaces. 


1. Introduction 

The goal in (binary) classification is to leam to accurately predict the label y G {—associ¬ 
ated with an input x. Unfortunately, it is NP-hard even to approximate this problem in easy cases 
(Guruswami and Raghavendra, 2006); thus a computationally attractive surrogate is often utilized. 
Foremost amongst these is convex risk minimization in which a sequence of predictors are produced 
which minimize in the limit some convex upper bound on a predictor’s classification error. In this 
paper, we attempt to analyze the effectiveness of such methods in as much generality as possible. 
Specifically, we aim to address the following questions: 

(Ql) Suppose a sequence of predictors minimizes the convex risk over the true distribution. Does 
this sequence converge to some concrete object? This question is murky because convex 
functions need not have a minimum; for instance, the function has no minimum, but rather 
is minimized in the limit x —)• —cx). For the high-dimensional problems considered in convex 
risk minimization, the minimum may also only occur “at infinity” but in a far less straight¬ 
forward way. This is typically the case, for instance, for standard boosting algorithms like 
AdaBoost (Schapire and Freund, 2012). In such cases, what can be said concretely about the 
convergence of a minimizing sequence? 

(Q2) Now suppose a given sequence of predictors minimizes the empirical convex risk, meaning 
the convex risk over some finite random draw from the distribution. What can be said about 
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convergence with respect to the true distribution? In other words, what can be said about gen¬ 
eralization and learning? The resolution is unclear here as well, since the preceding question 
highlights the need for predictors to be arbitrarily large, thus dooming the boundedness on 
which most standard statistical procedures rely (Boucheron et ah, 2005, Section 4). 

In this paper we resolve both these questions by showing that convex risk minimization converges 
to a unique conditional probability model r/. 

Main results. To state our main theorems, we first present our learning setting. We consider linear 
classes of functions. That is, given a base set Tf of prediction functions /i : X —)• [—1, +1], the 
corresponding linear class consists of weightings of these functions described by weight vectors w 
with I < °° where w[h] denotes the weight of the function h, and where it is understood 

that these weights are non-zero only on a countable subset of Tf. Formally, this class is 

{x I—)• w[h] h{x) : |rc[/i] | < oo|. 

This setting recovers, for instance, the classical regression setting by choosing T£ to consist of 
covariates corresponding to the dimensions of x, as well as the classical boosting setting by leaving 
‘K arbitrary. Let Li(2f) denote all possible choices for w as above; moreover, given w G Li(2f), 
let Hw : X —M denote the corresponding element of the linear class, meaning, {Hw){x) = 
h{x). Thus, Tf is a linear operator, abstractly collecting the elements of TC as “columns”. 
The loss functions I that we study come from a large class of certain twice continuously 
differentiable losses, whose precise definition is deferred to Section 3. Both the well-studied logistic 
loss l\og{r) := ln(l -|- exp(r)) and exponential loss ^exp(?') := exp(r) belong to With respect 
to loss I, we define the population and empirical convex risk to be 

3l{w) := J £(^-y{Hw){x)^dfi{x,y) and ’^njw) := - y^j(-yi{Hw){xij), 

2=1 

where ?/i))”=i is an i.i.d. draw of size n from the true distribution y,. Lastly, we define fhe 
excess convex risk £(rc) := lR{w) — lR(n), with E,n defined analogously. 

There are well-esfablished methods for converting the models produced using convex risk min¬ 
imization into conditional probability models. Specifically, given loss I G funcfions 2f, and 
weighting w G Li{‘K), we define 

Hr) ■= and r]^{x,y) := ^\^y{Hw){x)y (1) 

This funcfion rjwix, y), which is well-defined with range [0,1] for all £ G can be regarded as 
an estimate of the conditional probability Pr[y = y\x\ (Friedman et ah, 2000; Zhang, 2004; Bartlett 
et ah, 2006). For example, logistic loss £iog yields the usual sigmoid 0(r) = (1 -|- exp(—r))“^. 

Our convergence results do not apply to the weighting sequences {wi)i>i directly, since, as ear¬ 
lier mentioned, these will often have no limit. Instead we prove convergence of their corresponding 
conditional probability models. Specifically, our first main result, the resolution of (Ql), states that 
minimizing IR implies convergence to a unique conditional probability model y. 
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Theorem 1.1 Let loss i G probability measure p, and hypotheses be given. Then there 
exists a unique conditional probability model fj : X x { — 1 , +1} —>• [ 0 , 1 ] and a function /i : M —)■ 
M+ with fi{e) ^ 0 as s I 0 such that every w G Li{!K) satisfies 

j\h{x,y) - hw{x,y)\dp{x,y) < fi {8-{w)). 

In particular, every sequence {wi)i>i with limj_>.oo ^{wi) = 0 satisfies ^ V in 

Note that the existence of p is not immediate given the existence of sequences minimizing 3? since 
the collection of mappings from X to [0,1] is not compact in the Li{p) metric in general. Instead, 
the proof here constructs p directly via duality, and thereafter uses duality to control these sequences. 

Theorem 1.1 carries two essential consequences. First, our analysis provides a convergence 
concept for algorithms utilizing convex risk minimization that is more general than previous ap¬ 
proaches in the sense that it can handle, for instance, the unregularized boosting methods of Zhang 
and Yu (2005, Algorithm 1), or even any regularized scheme with regularization weakening to zero. 
Secondly, the real-valued model Hw can be used for classification simply by taking its sign, which 
is exactly equivalent to the sign of 77 ^(•, 1 ) — 1 / 2 , that is, the more likely label according to the cor¬ 
responding conditional probability model p^. Therefore, convergence properties of {pwi)i>i imply 
convergence properties of the classification errors made by {Hwi)i>i, complementary to the results 
of Bartlett et al. (2006) and Zhang (2004); see Proposition 1.3. 

Next comes the resolution of (Q2): under the assumption |Tf| < 00 , we show that it suffices fo 
minimize fhe empirical risk 

Theorem 1.2 Suppose the setting of Theorem 1.1, in particular the existence ofp, but additionally 
that |Tf| < 00 . There exists a nonincreasing function /2 : M —)• M+ such that, with probability at 
least 1 — 5 over an Ltd. draw of size n > / 5)) from p, every w G Li{'K) satisfies 

J\v{x,y) - Pw{x,y)\dp{x,y) = 0^/2 (^£n(w)) l^\/£n(w) + ^ 

where Llf) and O(-) omit constants based on TC, I, and p, but not on the sample, or on w. In 
particular, any sequence {wi)i>i with limj^oo £i{wi) = 0 satisfies p^^- —)• p in Li{p) a.s. 

Nofe fhaf perhaps fhe mosf nafural approach fo proving fhis fheorem—namely, fo apply prop¬ 
erties of Rademacher complexify of Lipschifz functions (Boucheron ef ah, 2005) —infroduces a 
dependence on fhe norm of ||m||. Insfead, fhe bound above only exhibifs a dependence on tn{w), 
which can be made arbifrarily small by considering only nearly opfimal choices. Depending on 
tn{w) rafher fhan ||u)|| is essential as fhese minimizing sequences will generally exhibif unbound¬ 
edly growing norms, a facf often encountered in practice (see Appendix D). Nofe fhaf while The¬ 
orem 1.2 requires sfricfly convex losses, if is proved via generalization bounds which can handle 
more fhan jusf in particular fhe hinge loss (see Lemmas 3.10 and J.9). 

Illustrative example. Suppose X = [—1,1]^ and Tf consists of the coordinate functions. Consider 
I = ^log, ite., logistic regression. Suppose that the measure p puts all of the mass on points x that 
fall into two well-separated rectangular regions (depicted as red and blue in Figure 1), with points 
in the blue region having Pr[y = l|x] = 1 and points in the red region having Pr[y = — l|x] = 1. 
From the figure, it is clear that there exist two distinct vectors, wi and m 2 , both of which define the 
lines (perpendicular to them) separating positive and negative examples. 
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The convex risk 3? is minimized by both of the sequences 
{iwi)i>i and moreover, the infimal risk is 0 , which 

is not attained by any w £ Li (Ji) = and every minimizing 
sequence has norms growing unboundedly. 

Conceivably, minimizing logistic loss could lead one algo¬ 
rithm to follow the sequence (irci)i>i and another to follow 
(iyj2)i>i- Both of these sequences converge in the Ti(/i) met¬ 
ric; their respective limit points, and are equal to 1 , 

1 / 2 , and 0 (for the positive class) on those points which have 
inner product, respectively, positive, 0, and negative to mi or Figure 1: A well-separated classi- 
W2- Consequently, rj{i) ^ ^( 2 ) This shows that two different ^cation problem, 
runs of logistic regression could give different probability estimates at some points. How then can 
Theorem 1.1 give a unique limit ?)? The resolution is that Theorem 1.1 gives convergence in the 
Li(/r) metric. In particular, wi and W2 only disagree on the region between the two point clouds; 
this is a measure zero set, and thus ^-a.e. 

Note that in this setting, it is also straightforward to prove an analog of the uniform deviation 
bounds of Theorem 1.2; indeed, applying either VC theory (Boucheron et ah, 2005) or margin 
bounds (Schapire et ah, 1997) will yield a bound that also lacks dependence on ||u;||. The distinc¬ 
tion, however, is what both results say when applied to a sequence which does not achieve zero 
classification error. As will be shown in Proposition 1.3, the classification error of these sequences 
may be erratic, and therefore only loosely describes convergence behavior. On the other hand. The¬ 
orem 1.1 and Theorem 1.2 give a concrete object, fj, to which all minimizing sequences converge. 

Classification errors and consistency. Let Jlz{g) ■= Pr[V / sign(g'(X))] denote the classifica¬ 
tion error of any mapping <7 : X —)• M, where sign(r) := l[r > 0] — l[r < 0]. Recall that the signs 
of (•, 1) — 1 /2 and Hw agree, which suggests that, because ^ f] as provided by Theorems 1.1 
and 1.2, there might be a relationship between (^'Jlz{Hwi)).^^and (fkz(r)(-, 1) —1/2))^^^. However, 
convergence is stymied by the points where fj = 1/2, that is, the points where sign( 7 )(-, 1) — 1/2) 
is discontinuous. The following result provides that, excluding this set, the desired convergence 
indeed occurs; in order to state it succinctly, further let r]^{x,y) := Pr[y = y\x] denote the true 
conditional probability model, and /rx the marginal distribution along X. 



Proposition 1.3 Suppose the setting of Theorem 1.1, and let (mj)“ ^ be any sequence with —)• p 

in the Li{p) metric, and set A := {(x, y) : fi{x, y) = 1/2}. Then 


lim sup 


‘RziHwi) - fkz( ??(•, 1) - 2 


< lim sup 

Z^OO 


jj^p^{x,l) 


-11 


1 ) < - 


dpxix] 


Moreover, there exist choices of {p, Tf, i) such that * > 0 and the inequality is an equality. 


The proposition implies that the difference between the classification error of fj and that of 77 ^. is 
bounded by 77(77 = 1/2) in the limit. The fact that the bound in the proposition can be tight, i.e., 
there is a gap between the classification risks even as 77 ^. —)• 77 , implies that the classification risk 
cannot be easily used to show convergence of p^.. Similarly, as discussed with the example in 
Figure 1, any approach to the generalization analysis that bounds classification error, such as VC 


4 









Convex Risk Minimization and Conditional Probability Estimation 


theory, will be problematic since the classification error can behave erratically, as provided by the 
possibility of * > 0 in Proposition 1.3. 

Finally, recall the classical consistency results (Zhang, 2004; Bartlett et ah, 2006), which may 
be summarized as follows. Let MF denote the set of all measurable functions. Then there exists a 
function /s : M —)• M+ with / 3 (e) —)• 0 as e 0 so that every w G Li{‘K) satisfies 

^ziw) - 

where the last expression overloads 0l{f) = f £{—yf{x))diJ,{x,y). As such, this result can be 
seen as a combination of Theorem 1.1 and Proposition 1.3 when span(2f) is a rich family of 
functions (e.g., dense in MF). Consequently, the results of the present work can be seen as com¬ 
plementary, providing a specific convergence target fj in the case of smaller span(Tf) (e.g., when 
'va^w&Li{'K) 3?(ru) > inf/gMF 3?(/)), rather than a single-sided bound as above. 

Outline. We close this introductory section with further notation. In Section 2, we construct f) 
via convex duality, and sketch the proofs of Theorems 1.1 and 1.2 in Section 3. Many appendices 
collect further technical discussions and proof details. 

Basic notation. Symbols defined in the preceding subsections—^risk Ik, excess risk £, link func¬ 
tion (j), conditional probability model r]w —will continue to be used in future sections. The weighting 
space Li(lK) should be viewed as the Li space over the counting measure on elements of Tf; since 
h G 2f always has supj, |/i(x)| < 1, it follows that sup^, |(Fft(;)(x)| < ||tu||i. Furthermore, in 
addition to the operator H, also let A denote the operator for which {Aw){x,y) = —y{Hw){x), 
whereby 

Ol(w) = J i[{Aw){x,y)^dyL{x,y) = J £{Aw)dfi, 

where the last form drops integration variables for succinctness. 

We assume that y can be disintegrated (Chang and Pollard, 1997) into a marginal measure yx 
over X, and a conditional probability t 7 ^(x, y) := Pr[y = y\x\. Let Z := X x {—1, -1-1} be the set 
of all (x, y) pairs. Given any subset G C Z, we define the intersection measure yc{S) := y{C n S) 
and conditional measure y^Q, where y{C) > 0 implies yc{S) = y\c{S)d{C). We use a “hat” 
symbol to denote empirical measures, such as y, yc, y\c- To avoid ambiguity, we sometimes write 
1 R(-; v) and £(•; v) to denote risk and excess risk when integration is over a measure v. 

Every loss £ : M —M_|_ considered in this paper is a classification loss, meaning it is convex, 
non-decreasing, and satisfies £(0) > 0 and vniz^-^£{z) = 0. The class of all such losses is denoted 
L. The subset of these that are strictly convex and twice continuously differentiable (i.e., £” > 0) is 
denoted The more restrictive class C 1^+ will be defined in Section 3. For classification 
losses, which are not necessarily differentiable, we write £'{z) to denote a fixed choice from the 
subgradient d£{z)', thus, a classification loss is described by a pair {£, £') satisfying £'{z) G d£{z). 

2. Duality: The journey to the optimal conditional prohahility model f) 

This section shows the existence of the optimal conditional probability model fj. The key challenge 
is the infinite dimensional setting, that is, the fact that the hypothesis space Tf and the sample space 
2. are infinite. To develop some intuition, we begin by studying the finite dimensional case. 
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2.1. Warm-up: Finite dimensional case 

Assume for now that the hypothesis set is finite, |!K| = d, and the measure // is uniform over n data 
points. Consider the problem of optimizing exponential loss over this measure, i.e.. 


inf 


n 




( 2 ) 


The conditional model for the exponential loss is 

^y(Hw){x) 

Vw{x,y) = ^y(^Hw){x) f,-y{Hw){x) ' 

Recalling the example from Figure 1, note how easily the infimum to Eq. (2) may fail to be attained. 
In particular, if there exists m G defining a hyperplane which sfricfly separates fhe posifive and 
negafive examples, fhen fhe sequence {jw)j>i achieves zero risk in fhe limif, whereas every elemenf 
w £ achieves a posifive risk. On fhe ofher hand, r]jw{xi, yi) —)• 1 as j —)• oo. So in fhis case, fj, 
which needs fo be defined only over fhe examples {xi, yi), is described by fi{xi, yi) = 1. 

Similar fo ofher sfudies of risk minimizafion sfymied by fhe problem of missing minimizers 
(Collins ef ah, 2002), we consider fhe convex dual fo Eq. (2). The dual of loss minimizafion of 
a linear model is fhe problem of maximizing enfropy subjecf fo consfrainfs, where differenl losses 
yield differenl kinds of enfropy (Collins el ah, 2002; Allun and Smola, 2006). The dual of Eq. (2) is 


max 


n 

'^{-qilnqi + qt) 
_i=l 


n 

s.f. ^ qi{yih{xij) = 0 for all /i G IK . 

i=l 


( 4 ) 


The objecfive on fhe lefl is an unnormalized enfropy of fhe dual variable vector q, represenf- 
ing an unnormalized reweighling of examples. The unnormalized enfropy is being maximized 
over fhe sef of reweighfings, which salisfy “decorrelalion” consfrainfs on fhe righl. Specifically, 
fhe consfrainfs require lhaf fhe reweighling q be uncorrelaled wifh every hypolhesis, making fhe 
reweighled predicfion problem as hard as possible. Note fhaf q = 0 is always feasible, bul fhe 
unnormalized enfropy pushes fhe solution away from zero whenever feasible (fhe slope of enfropy 
al zero is —oo (Eemma E.l.v)). Theorem 2.1 shows lhaf fhe dual maximum is always allained, 
unlike fhe primal minimum. However, if bolh fhe primal maximum w and dual maximum q are 
attained, fhen qi = ex.p(^—yi{Hw){xi)). Eor a general differentiable loss £, fhe opfimalify condi¬ 
tions yield qi = £'(^—yi{Hw)xi). If Ihere is any example j such lhaf Xj = Xi, buf fhe label is 
flipped (yj = —yi), fhen we can rewrite qj as qj = exp(yi(iFti))(xj)) for exponential loss, and 
Qj = {yi{Hw){xi)) for a general loss. Eel —i denote such an index j if if exisfs. Confrasling fhe 
definilion of in Eq. (3) wifh fhe opfimalify condilion for q suggesls defining 


d{xi,yi) 


q-i/ {q-i + qi) if qi > o, 

1 if qi = 0 , 


where in fhe absence of fhe example wifh fhe flipped label, define = £'(—(/)“^(qj)) to emulate 
such an example; for exponenfial loss, = 1/qi. The value of fj for qi = 0 is obfained by faking 
fhe limif g* —)• 0 (i.e., q-i —)• cx) for exponenfial loss). The nexf section shows fhaf fhis f) is fhe 
correcl limif objecl, even for an infinife sample space and an infinife hypolhesis sef. 
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2.2. Infinite dimensional case 


Before constructing fj and proving Theorem 1.1 , we establish an infinite dimensional duality result 
similar to the finite dimensional result from Section 2.1. In the primal, we now minimize an integral 
rather than a sum. In the dual, we optimize over unnormalized densities over Z. Recall that the 
linear map A returns functions on Z such that {Aw){x, y) = —y{Hw){x). Formally, we seek the 
following duality result: 


inf 
w^Li(‘K) L 


w)dix 


= max 
qSQ 




x,y 


x,y) 


( 5 ) 


s.t. Jq{x, y) (^yh{x)^ dy{x, y) = 0 for all /i G Tf 


where i*{s) := sup^[rs—£(r)] is the conjugate of i (see Appendix A). For example, when £ denotes 
the exponential loss, we find fhaf (s) = s In s — s for s > 0 and £*{s) = oo for s < 0, giving rise 
fo fhe non-negafivify consfrainf on q and fhe dual objecfive we already saw in Eq. (4). 

A crucial fechnical question is fhe choice of Q, i.e., fhe sef fhaf q is selecfed from. Following 
fhe infuifion of Section 2.1, fhe goal is fo consfrucf fi{x, y) = q{x, —y)/{q{x, —y) + q[x, y)). The 
space Q should be large enough fo allow consfrucfion of any conditional probabilify disfribufion rj 
for yx- To achieve fhis, if suffices fo make sure fhaf all measures which are absolutely continuous 
wifh respecf fo y have fheir densities included in Q. In facl, our sef can be slighfly smaller: if jusf 
needs fo include all densities for which fhe dual objective, i.e., fhe integral f £*{q)dy, is finite. 

One candidate class of functional spaces is Lp{y), where p > 1. These are Banach spaces of 
measurable functions wifh fhe norm defined by ||/||p = (f \f\^ The space Lp{p) confains 

all measurable functions wifh ||/||p < oo. However, in our selling, we instead wanf fo place reslric- 
lions on fhe allowed functions q based on fhe integral f i*{q)dp rafher lhan f \q\Pdp. Therefore, 
instead of working wifh Lp{p) spaces, we work wifh fheir generalization called large Orlicz spaces 
(Leonard, 2007, and Appendix B), which allows us fo failor fhe sef Qto£*. 

In delail, fhe consfrucfion of a large Orlicz space begins wifh a non-negative convex function 
0 : M —)■ [0, oo] symmelric around zero (i.e., 6{r) = 0(|r|)), nol identical fo zero (i.e., 6{r) —)■ oo as 
r oo, by convexity), and wifh 0(0) = 0. This function 0 serves fhe same role as fhe p-th power 
funcfion in fhe consfrucfion of Lp{p,). The condifions fhaf we place on 0 make if possible fo define 
“fhe unil ball” of functions, analogous fo fhe unil ball in Lp{p), namely 

'B := |/measurable : j9(^f{z))dp{z) < l| . 

This sef is Ihen used fo define fhe norm ||/||e = inf{r > 0 : / G rB}, where fhe norm equals oo 
if / is oufside fhe scaled ball rB for all r > 0. The large Orlicz space Lg{pL) is defined fo confain 
all measurable functions wifh \\f\\g < oo. For p > 1, fhe choice 0(s) = |s|p recovers fhe Lp{p) 
spaces. (See Appendix B for furfher background.) 

Now we are ready fo answer whaf fhe space Q should be. Following fhe consfrucfion of 
(Leonard, 2008), we begin by infroducing a symmefrized version of fhe loss £ wifh fhe firsf-order 
Taylor expansion af zero removed: 

(3{s) := max |^(s) - (^£(0) + s/(0)^, £{-s) - ^£{0) + (-s)/(0)^ | . (6) 

If fums ouf fhaf fhe Orlicz space L^*(p), derived from fhe conjugate /3*, satisfies our desiderafum 
on Q: if confains all fhe densities wifh respecf fo p whose dual objecfive is finite (see Lemma G.l.iii). 
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The next theorem spells out the duality result of Eq. (5) with a more succinct representation of 
constraints via adjoint of the operator A. The adjoint is a generalization of the matrix transpose. 
The adjoint A~^ is a linear operator which maps q into a linear function on defined by 

{A~^q){w) = J {Aw){z) q{z) dfi. The constraint of Eq. (5) is equivalent to requiring {A~^q){w) = 0 
for all w, i.e., A~^q is required to be the zero of the vector space of linear functions on Li (IK). Thus, 
the constraint can be written as A~^ q = 0, highlighting the fact that it is a linear constraint on q. 

Apart from the duality result, the theorem also enumerates several important properties of the 
dual optimum, which are relevant for the construction of fj in Definition 2.2 below. Properties (i) 
and (ii) show that fy is a well-defined condifional probabilify. Properfy (iii) implies fhaf fy(x, y) = 
Vwix, y) = (j){y{Hw){x)) when fhe primal opfimum exisfs and fhe loss is differenfiable. Properfy 
(iv) looks more technical: if implies fhaf when fhe primal opfimum w does nof exisf, h{x) := 
{£')~^{q{x, —1)) can serve a similar role as Hw, because fi(x, y) = (j){yh{x))', indeed, we use fhis 
consfrucfion of h in Section 3.1. 


Theorem 2.1 Let finite measure y over Z, hypotheses Ji, and loss function i be given, with f 
defined by Eq. (6). Then 


inf 

uieLi(Jt) 


'w)dp. 


max 

AXq=0 


I t{q)dp 


(V) 


A dual optimum q always exists, and can be chosen to satisfy the following, p,-a.e. over (x, y): 

(i) q{x, y) > 0. 

(ii) q{x, y) + q{x, -y) > 0. 

(iii) q{x, y) G d£{Aw){x, y) where w is a primal optimum (if it exists). 

Furthermore, 

(iv) If £ G then {£')~^ {q{x,y)') = —{£')~^ {q{x, —y)), p-a.e. over all {x,y)forwhich {£')~^ 
is defined at both q{x, y) and q{x, —y). 

(v) If £ is differentiable, then q is unique (up to p-null sets). 


Using pari (v), we oblain fhaf fhe following defines a unique q (up fo y-null sels): 


Definition 2.2 Let £ be differentiable and q be the dual optimum satisfying conditions (i) and 
(ii) of Theorem 2.1. We define the optimal conditional model q as 


q{x,y) 


Q{x,-y) 

q{x, -y) + q{x,y) 


( 8 ) 


This is fhe q fhaf appears in Theorem 1.1. This fheorem will be proved in fhe nexf section. 


3. Convergence and generalization via easy and difficult sets 

We saw in Section 2.1 that the conjugate £* of the exponential loss has an infinite slope at zero; the 
same turns out to be true for all losses in (Eemma E.l.v). Informally, this means that the dual 
optimization avoids setting q = 0 unless forced to do so by the decorrelation constraint A~^ q = 0. 
We will see that this distinction between the set of points where q = 0 and the set where q > 0 
is fundamentally important to the analysis, a fact seen before in the analysis of boosting (Mukher- 
jee et ah, 2011; Telgarsky, 2012, 2013). We call these two sets of points “easy” and “difficult” 
(respectively) for reasons which we illustrate on an example. 
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An example. Consider the example in Figure 2, which builds on the example from Figure 1. 
In addition to the two well-separated regions of positive and negative examples, we now add an 
alternating sequence of positive and negative point masses along the line 7 orthogonal to the weight 
vector wi- Each weight vector m G represents a linear predictor returning the inner product 
X ^ w ■ X. The margin of a data point {x,y) with respect to this predictor is y{w ■ x). The 
decorrelation constraint (see Eq. (5)) requires that the weighted margin of every hypothesis (and of 
every linear combination) according to the density q is equal to zero. The predictor described by 
wi gives a positive margin to all points in the two separated regions (the easy set) and zero margin 
to those along the line 7 (the difficult set). Hence, any q satisfying the decorrelation constraint 
must equal zero over these two regions. On the other hand, because the point masses along 7 are 
antisymmetric around zero, each of them can receive the density q{x, y) = s where s is a minimizer 
of i* (it always exists by Lemma E.l.i). 

In the primal, the sequence {iwi)i>i still minimizes the 
risk as follows. Eirst, the risk in the two regions goes to zero. 

Next, the risk of any weight vector w over points along 7 is 
only a function of the projection of w onto 7. Since the masses 
along 7 are antisymmetric and the loss function is convex (and 
increasing as the prediction is more wrong), the projection 
needs to be at the origin to minimize the risk along 7. This 
is exactly the case for iwi by orthogonality. 

If the example were to be slightly perturbed, so that the 
point masses would still lie on 7 in an alternating pattern (but 
not antisymmetric), a minimizing sequence would take the 
form {w + iwi)i>i where n) G 7 would be the minimizer of 
cause of the alternating pattern such a minimizer would be bound to exist. 

Preliminaries. Several aspects of the example carry over to the general setting. Eirst, it can be 
shown that the risk on points where q = 0 converges to zero when the primal is minimized, that is, 
a perfect classification is achieved. Therefore, we call this set of points “easy”. Second, the points 
where q > 0 cannot be further “separated” in the sense that any w under which some non-null 
measure of these points receives a positive margin also yields a non-null measure of points with a 
negative margin. We call this set “difficult”. 

Definition 3.1 Given a finite measure y, hypotheses “K, loss £ G L, and a dual optimum q satisfying 
the conditions of Theorem 2.1, the difficult set is defined as T> := {z G Z : q{z) > 0}. Its 
complement T>^ is called the easy set. 

The next lemma (and the following corollary) show that, similar to the example, all of the 
risk is in fact due to the difficult set. The lemma proves equality of the dual objectives for p and 
the restricted measure and furthermore that q is feasible and optimal for both problems. The 
corollary highlights the implications in the primal, that by optimizing the risk on p, we optimize the 
risk on the difficult set, and drive the risk on the easy set to zero. Eor technical reasons, both results 
are stated for supersets of difficult sets. 

Lemma 3.2 Given a finite measure p, hypotheses “K, loss G L, a difficult set D and an associated 
dual optimum q, let D be an arbitrary (measurable) superset of the difficult set: T) C. D. Then the 



the risk of the points along 7. Be- 
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dual optimal values for p, and pD equal: 


max 

q&Lp* {p)\ q=0 


I t{q)dp 


max 

q^Lp*{jj.D)- A^q=0 



The general dual optimum q is feasible for both problems and attains both maxima. Moreover, ifqo 
is a dual optimum for p^, then q{z) := qD{z)l[z G D] is also a dual optimum for both problems. 


Corollary 3.3 Let D be a superset of a difficult set, T) C. D. Then £,{w, pd) < £(^) 
!R(w; pd^) < L{w)for all w G Li{T[). 

We wrap up this section by defining the class appearing in our main results. While the class 
may appear restrictive, it contains the logistic and exponential losses by Proposition E.6: 

Definition 3.4 The class C consists of strictly convex, twice continuously differentiable 
classification losses i, which in addition satisfy the following conditions: 

(i) The link function f, derived from i as in Eq. (1), is Lipschitz-continuous with constant L^. 

(ii) For some ci > 0, the derivative £' satisfies i'(r) < Cii{r) whenever r < 0. 

(iii) For every finite measure p over Z, there exists > 0 with ||/||/3 < J i{f)dpfor every 
measurable / : 2. —)• M+. 


3.1. Proof outline for Theorem 1.1 

Recall that our goal is to show that risk minimization yields convergence of to fj. First consider 
the easy set V^. By Corollary 3.3, minimizing i.e., taking £(ru) to zero, leads to 0l{w, p^c) 

becoming arbitrarily small. This in turn means that most predictions [Hw] (x) will not only have the 
correct sign, but will also have a large margin. This observation can be used to obtain the following 
bounds on a partition of the easy set into two sets: Sr and \ Sr. The bound on p{Sr) is also 
a bound on fg |fy — because |fy — ry^| < 1. Thus, together these bound Jjjc If — Vw\dp. 

Lemma 3.5 Given a finite measure p, hypotheses “K, loss £ G L, and a difficult set D, let D be 
an arbitrary (measurable) superset of the difficult set: T) C. D. Let any w G LifH) and r > 0 be 
given, and define Sr '.= {z G : l{{Aw){z)) > r}. Then: 

(i) p{Sr) < 0l{w;pD<^)/r < £{w)/r, 

JDc\Sr\d -hwl dp <rp{D'^\ Sr) max {l/£{0), ce/i'{0)} ifi£l}^^. 

It remains to control ry^ over D. As mentioned earlier, the decorrelation constraint implies that 
the difficult set T) cannot be “separated” in the sense that any w under which some subset of D with a 
positive measure p has a positive margin (i.e., correct predictions), also yields a positive measure of 
points in D with a negative margin (i.e., incorrect predictions). Since the loss is increasing over neg¬ 
ative margins, this structure implies that the risk over D has a minimizer over each one-dimensional 
subspace (similar reasoning to the example of Figure 2). This one-dimensional property can be used 
in finite dimensions to argue that the risk must have a minimizer over the difficult set, and we pursue 
this line of reasoning in Section 3.2. But here, we need an alternative approach. 

As discussed in Theorem 2.1.iv, if £ G I?'*', then fi{x,y) = f){—f{x,y)) with f{x,y) = 
{F)~^{q{x,y)) whenever is defined for both q{x,y) and q{x,—y). Fortunately, this can 

be shown to hold p-a.e. over D. Thus, over D, we can write \ fj — Pw\ = |</’(—/) — The 

next lemma uses a second-order Taylor expansion at / to further derive a bound on this difference. 
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In Lemma 3.6, we split the difficult set into four subsets and we either bound their mass, which 
in turn bounds the integral of | ry — |, or directly bound the integral. The integral is controlled 

directly over the subset U by the mentioned Taylor bound, and so it requires the bounds on the 
range of Aw and / (via q), and a corresponding lower bound r on the second derivative. The subset 
5+ contains points with a large loss, so its mass is controlled by the risk. The control of the subset 
S- is the most technical. The set includes points where the predictions are correct, but the density q 
is large. The bound is based on the decorrelation constraint as well as property (iii) in Definition 3.4. 
All three bounds depend on w only via its risk; this is indeed key to establishing Theorem 1.1. The 
set V needs to be controlled separately. 


Lemma 3.6 Given a finite measure yr with < 1, hypotheses Tf, loss I G a difficult set T> 
and an associated dual optimum q, let a weighting w G Li(Tf) be given, along with scalars ci > 0, 
C2 > 0, C3 > C2, and r := min{inf|j,|<c^ (z))}- Define the following sets: 


S+ := {z £ D : {Aw){z) > ci} , 

V ■.= {z £T) : q{z) < C 2 or q{z) > C 3 } . 

j^\h- 'nw\dp < 

To prove Theorem 1.1 from here, first split f |ry^„ — pjdp along V and and apply Lemma 3.5 
and Lemma 3.6 to the two pieces; the goal is to show that all terms go to zero as £(w) —>■ 0. In the 
terms resulting from Lemma 3 . 5 , this is handled by the choice r := y^£(u)). Similarly, it is possible 
(although considerably more challenging) to balance ci, C 2 , C 3 , r arising from Lemma 3 . 6 . 


U := {z £ T> : \ {Aw){z)\ < ci and C2 < q{z) < C3} , 
S- := {z £ T) : {Aw){z) < —ci and q{z) > 02 }, 

Then V = UU S+U S-UV, 


KS+) < 


Jl{w) 

cif(O)’ 


KS-) < 


2c£,;,||g||/3*lk(m) 


C1C2 


3.2. Proof outline for Theorem 1.2 

In this section we sketch the proof of the generalization bound from the introduction (Theorem 1.2). 
Unlike the foregoing results, here we assume that the hypothesis space is finite, |!K| = d. 

Similar to Section 3.1, the proof treats the easy set and the difficult set separately. On the easy 
set, where zero risk is possible in the limit, linear predictors actually achieve zero classification 
error when viewed as half-space classifiers. Finife dimension d fhen implies a finife VC dimension 
and fhe corresponding generalization bound. In fhe remainder, we only focus on fhe difficull sef. 

We build on fhe facf fhaf on fhe difficulf sef T) fhe risk is evenfually increasing along any direc¬ 
tion which ties in fhe “span” of T) (similar fo fhe example of Figure 2). In fhe finife dimension d, 
fhis will imply a bound on fhe norm of fhe opfimizer of risk over V, and also enable fhe application 
of Rademacher complexify fo obfain a generalization bound. 

We begin wifh a specific lower bound in each direcfion w wifhin fhe “span”. The bound is 
obfained by infegrafing over all poinfs wifh a negative margin, i.e., {Aw){z) > 0. Because of fhe 
lack of separators over T), fhe bound is non-zero. Taking an infimum over all directions yields a 
uniform bound called balance. While fhe following definition is wriffen for any measure p, if is 
going fo be primarily applied wifh p-j) subsfifufed for p\ 

Definition 3.7 The balance associated with hypotheses “K, |!K| = d, and measure p is defined 
as Bal(/r) := inf {J |(At(;)( 2 :)|+ dp{z) : w £ Ker(;u)-*-, ||ty||i = l}, where |s|+ := max{s, 0} de¬ 
notes the non-negative part, and Ker(/i) := jti; G : {Aw){z) = 0, p-a.e. over zj denotes the 
sub space ofW^ with no effect on risk under p. 
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The “span” corresponds to the orthogonal complement of the kernel Ker(//). In the example of Fig¬ 
ure 2 , the difficult set consisted of the points on the line 7 , and the kernel Ker(^D) was the subspace 
spanned by the vector mi, which had no effect on the risk over points on 7 . The only interesting 
directions from the perspective of this risk were in the orthogonal complement Ker(/rD)-*-. 

In finite dimension d, we obtain that Bal(/iD) > 0 whenever //(D) > 0 (Proposition J.4). This 
yields a non-trivial risk bound from the definition of balance, using the fact that £{r) > f'(O) + 
ri'{0) > r£'{0) (by convexity and non-negativity of £): 

3?(m;/iD)> [ £{Aw)dn'D > [ /(0)(Am)d//D >/(0)||u;||iBal(//D). 

JAw>0 JAw>0 

Rearranging, we also obtain a norm bound ||m||i < lR(m)/(f"(0)Bal(/iD)), which enables the use 
of Rademacher complexity in the analysis of generalization on D. 

A less obvious consequence is that for a given finite hypothesis class Tf and measure //, there 
exists a maximal difficult set. This difficult set, common to the entire class 1^+, is called the canon¬ 
ical difficult set D* (for concreteness, we define if for £ = exp). Informally, ifs exisfence follows 
from fhe properly shared by all losses £ e L 2 + thaf (r)'(s) t 00 as s 4 , 0 (Lemma E. l.v); conse- 
quenfly, fhe opfimizafion prevenfs q from faking on fhe value zero unless forced by consfrainfs, and 
fhus yields fhe largesf possible difficulf sef: 

Definition 3.8 For a finite measure // and a hypothesis set with |Tf| < 00 , the canonical difficulf 
sef D* is defined as any difficult set associated with £ = exp. 

Proposition 3.9 Given a finite measure p, a hypothesis set with |T£| < 00 , and the corresponding 
canonical difficult set D*, we have: 

(i) For any £ G L and any corresponding difficult set “D, we have D C D* p-a.e. 

(ii) For any £ G E^"*" and any corresponding difficult set D, we have T> = D* p-a.e. 


We finish fhis secfion wifh fhe Rademacher complexify sfyle bound on fhe excess risk over fhe 
canonical difficulf sef D*, based on fhe norm bound implied by fhe balance. The key insighf is fhat 
fhe quantifies in fhe bound depend on w only fhrough fhe empirical risk Ol{w] dp^xifi- Theorem 1.2 
is fhen proved by spliffing f fiyj — fi\dp along D* and T)% and confrolling fhe pieces by a combi¬ 
nation of Lemma 3.5 wifh fhe VC sfyle bound (Lemma J.9) used fo selecf r, and Lemma 3.6 wifh 
fhe scalars chosen via Lemma 3.10. 


Lemma 3.10 Let probability measure p, hypotheses FC with |TC| = d, loss function f" G L, sub¬ 
gradient s G d£{£)), and a canonical difficult set D* with //(D*) > 0 be given. Set r(r) := 
inf| 2 |<^/'(z), Bal* := Bal(//| 23 ^) and let ■.= 2-\- [f'(O) -|- 2fk(t(;; (i//|D^)]/(sBal*), and n > 
256 ln( 8 (i/(i)/Bal^. Then with probability at least 1 — 46 over a draw from /ip^ of size n, the 
following statements hold simultaneously for every w G LifK): 

(i) |(Am)(z)| < Bu, for p-a.e. and p-a.e. z G D*. 


(ii) 

(iii) 


£(m,/i|i,J < £(m,/2|2)J -h W£{2Bu,)^Jln{8dBl/6)/n. 

or N/oc. - , , 1024£'{2B^)^ln{8dBl/6) 

£(n;,;/pj < 2£(m,//pJ +-- 


if£ G l2+. 
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Appendix A. Convex analysis in Banach spaces 

This appendix covers convex analysis results for functional spaces. It is based on Rockafellar (1974) 
and Rockafellar (1968). 

Banach spaces. A Banach space is a complete normed vector space. The space M” with the 
Euclidean norm is a Banach space. Given a measure p on% and p > 1, the Banach space Lp{p) 
consists of all measurable functions / : Z —)• M with the finite norm ||/||p := (/ |/|^ 

The analog of an inner product for Banach spaces is a pairing. Given two Banach spaces U and 
V, their pairing is described by a bilinear form t/ x U —)• M, denoted {u,v). Thus, each u & U 
describes a linear map v i—)• {u, v) on V and vice versa. Each Banach space is endowed with the 
topology implied by its norm, but other topologies are possible. We say that the topologies on 
U and V are compatible with the pairing if the linear functions described by u G U and u G U 
are continuous, and if they comprise all continuous linear functions on V and U, respectively. 
A Euclidean space with the norm topology is compatibly paired with itself via standard inner 
product. Given 1 < p, g < oo such that l/p-hl/q = 1, the spaces Lp{p) and Lg(p) with their norm 
topologies are a compatible pairing with the bilinear form (/, g) = J fg dp. One construction of 
compatible pairings begins with a Banach space U under norm topology, then takes its topological 
dual U' , i.e., the space of all continuous linear functions on U, and endows U' with the weak* 
topology. In the rest of the paper, when we talk about “paired Banach spaces” we assume that they 
have been endowed with compatible topologies. 
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Convexity, conjugacy, subgradients. Given a Banach space U, a function F : U ^ (— 00 , 00 ] 
is called proper if it is not equal to 00 everywhere. The set of points where F is finite is called its 
domain and denoted dom F. The epigraph of F is the set of points above the graph of the function 
{(u, t) : ?x G [/, f G M, f > F{u)}. The function F is called convex if its epigraph is convex. It is 
called closed if its epigraph is closed. 

Let U and V be paired Banach spaces. Let F : U ^ (— 00 , 00 ] be a closed proper convex 
function. The conjugate of F is defined by F*{v) := sup^gj; [{u, v) — F{u)]. It is also a closed 
proper convex function and F** = F (Theorem 5 of Rockafellar, 1974). From the definition of a 
conjugate, we get FencheVs inequality 

F{u)+F*{v) > {u,v) . 

The subgradient of F at u is the set dF{u) := {v ■. F{u') > F{u) + {u' — u, v) for all u' G t/} 
For a closed proper convex function F, the following statements are equivalent (Corollary 12A and 
the foregoing discussion of Rockafellar, 1974) (first-order optimality for conjugates): 

(i) F{u) + F*{v) = {u,v), 

(ii) V G dF{u), 

(hi) u G dF*{v). 

Integrals as convex functionals. Consider a finite measure p, on Z, and assume we are given a 
pairing of Banach spaces U and V via bilinear form (rt, v) = f u{z)v{z) dp{z), i.e., U and V are 
subsets of measurable functions on 2.. Let / : M —>^ (— 00 , 00 ] be a closed proper convex function. 
We study properties of the function F on U defined by fhe infegral 

F{u) = j f{u{z)) dp{z) . 

To esfablish ifs closedness and sfudy conjugacy we need fhe following definifion, adapted from 
Rockafellar (1968) for the case of a finite measure p: 

Definition A.l We say that a Banach space of measurable functions on Z is decomposable with 
respect to a finite measure p if the following conditions hold: 

(i) U contains every bounded measurable function from Z to M. 

(ii) Ifu£U and E is a measurable set, then U contains u ■ \e where 1 e A the indicator of the 
set E. 

The following proposition is a rephrasing of the corollary on page 534 of Rockafellar (1968): 

Proposition A.2 If p is finite and U and V are decomposable then F{u) is a closed proper convex 
function, and its conjugate is 

F*iv) = J r[v{z))dp{z). 

Next proposition presents two additional results relating the properties of F and /: 

Proposition A.3 If p is finite and U and V are decomposable then 
(i) V G dF{u) if and only ifv{z) G df{u{z )). F- a.e. over z. 
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(ii) If f is strictly convex, then so is F. 

Proof To show part (i), use first-order optimality for conjugates to obtain that v G dF{u) if and 
only if 

F{u) +F*{v) = {u,v) . (9) 

Since FencheTs inequality holds pointwise, i.e., f[u{ z)) + r{viz)) > u{z)v{z), Eq. (9) is equiv¬ 
alent to 

f[u{z)) + f* (v(z)) = u(z)v(z), fi-a.e. over z, 
which, again by first-order optimality for conjugates, is equivalent to 

v(z) € df[u{z)), ^-a.e. over z, 

completing the proof of part (i). Part (ii) can be shown by contradiction. Assume that F is not strictly 
convex, i.e., F is flat along a line segment connecting points ui and U 2 which differ on a set of non¬ 
zero measure. Let u = {ui + uf)!’!. The flatness of F means that F{u) = [T"(ni) -|- F{u2)]/2, 
but pointwise, by convexity, f[u{z)) < [f[ui{z)) + f[u 2 {z))] /2, so we must actually have 
f{ui z)) = [f[ui{z)) + f[u 2 {z))] /2, /r-a.e. over z. Since ui and U 2 differ on a set of non¬ 
zero measure, we obtain that / cannot be strictly convex. ■ 


Fenchel’s duality. Given pairings {X, Y) and {U, V) of Banach spaces and a continuous linear 
map A : X —)• (7, its adjoint is a linear map : V ^ Y defined by {x, v) = {Ax, v). We 
finish this section by stating a version of Fenchel duality used in this paper. It is a rephrasing of 
the duality in Example 11’ and Eq. (8.26) on page 50 of Rockafellar (1974), adapted to stronger 
conditions (specifically, domF = X and domG = U)\ 

Theorem A.4 Let {X,Y) and {U,V) be pairings of Banach spaces. Let X : X —)• M and G : 
U ^Rbe closed proper convex functions and A : X ^ U be a continuous linear operator. Then 


inf 

x&X 


F{x) + G{Ax) 


max 

vC,V 


-F*{-A'^v) -G*{v) 


The point x is the primal minimizer if and only if there exists a dual maximizer v such that 

—A^v G dF{x) , V G dG{Ax) . 


Appendix B. Orlicz spaces 

The duality result of Section 2 is an application of Fenchel’s duality (Theorem A.4). As discussed in 
Section 2, the key challenge in applying the duality is the choice of appropriate pairings of Banach 
spaces. This appendix develops properties of specific Banach spaces, called Orlicz spaces, which 
will be sufficiently flexible to obtain pairings that satisfy our desiderata. 

Orlicz spaces generalize Lp{p) spaces introduced in Appendix A. The construction of an Orlicz 
space begins with a non-negative convex function 0 : M —)■ [0, cx)] symmetric around zero, not 
identical to zero, and with 0(0) = 0, which serves the same role as the p-th power function in 
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the construction of Lp{^). Given the function 6 and a measure /r, we first define the unit ball of 
functions 

“B := |/measurable : J 6[f{z))diJ.{z) < 1 
which is then used to define the norm || • H^: 

WfWe = inf{r > 0 : / E rB} 

where the norm equals oo if / is outside the scaled ball rB for all r > 0. 

The large Orlicz space L 0 {p) and the small Orlicz space Mg{p) are defined as 



Leil^) ■■= I 
Meip) := I 


/ measurable : 3r > 0 , 
/ measurable : Vr > 0, 


6 {rf)dp < oo 
6 {rf)dp < oo 


From the definition it is clear that 


Leiii) = {/ ; ll/lle < oo} , 

so for p > 1 and 6 {s) = |s|^, we recover Lp{p) spaces. The definition also implies Mg{p) C Lg{p). 
The following proposition summarizes key properties of Orlicz spaces used in this paper. Parts (i- 
iv) are paraphrased from Proposition 1.4, Proposition 1.14, Proposition 1.18 and Theorem 2.2 of 
Leonard (2007): 

Proposition B.l Let p be a finite measure and 0 : M —?• [0, oo] be a closed convex function 
symmetric around zero, such that 9{0) = 0 and neither 9 nor its conjugate 9* are identically zero. 
Then the following hold: 

(i) 9* is also symmetric around zero and 9* (0) = 0. 

(ii) Lg{p) and Mg{p) are Banach spaces with the norm ll-jle. 

(hi) For all f E Lg{pf and g E Lg*{p): J\fg\dp < 2 ||/|| 6 i||p|| 0 *. 

(iv) If 9 is real-valued, i.e., dom0 = M, then the topological dual of Mg is isomorphic to Lg*. 

(v) If 9 is real-valued, i.e., dom0 = M, then Mg and Lg* are decomposable. 

Proof of (v) Let / be abounded measurable function, say |/| < a. Then 9{rf{z)) < 9{ra), so 

J 9{rf)dp < 9{ra)p{Z) < oo for all r > 0 , 

implying / E Mg{p). Also, since dom0* / {0} and 0*(O) = 0, there must be some e > 0 such 
that [—e, e] C dom 0 *, and 

J ^ < oo 

implying / E Lg{p). To argue that condition (ii) of Definition A.l holds, note that if / E Mg{p) 
then any g with jgrj < |/| must also be in Mg{p), and similarly for Lg* {p). ■ 
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Appendix C. Rademacher complexity 

This section collects various results from the literature on Rademacher complexity. To start, given 
a set of vectors 1/ C and letting a G {—1,+1}” denote a vector of n independent Rademacher 
random variables (i.e., Pr[crj = +1] = Pr[crj = —1] = 1/2 for all i), define the Rademacher 
complexity d\ofV as 

/ ^ n 

91(F) := E sup — Viai 
\v&v n ^ 

To define fhe Rademacher complexify of a function / or funcfion class 3“ applied fo a sample 
§ := define / o S ;= {f{zi))f^i G M”", and similarly overload T o S C M”, finally defining 

91(3“) := 91(3“ o S). Nofe fhaf fhese definitions mafch fhe presenfafion of local Rademacher com¬ 
plexity (Bartleft ef ah, 2005), whereas fhe original definifion included an absolute value around fhe 
innermosf summation (Bartlett and Mendelson, 2002; Boucheron et ah, 2005). 

The essential link between Rademacher complexity and deviation bounds is as follows. 

Lemma C.l (Shalev-Shwartz and Ben-David, 2014, Theorem 26.5) Let loss I and function class 
3“ be given. Then with probability at least 1 — 5 over a draw of size nfrom p, 



sup 

/65 


(/ ^{f)dp - ji{f)dpn^ < 291(£ o 3“ O S) + 4 sup \iifiz))\\l^^^^^ 


z£§ 

/63^ 


/91^ o 3“ o § 2 41n 4/5 

< 4max< 1, sup £ / z \\ ---+-, 

zes V 2 n 

'' /e5 ^ 


Thanks to Lemma C.l, the task of controlling deviations has been reduced to the task of approx¬ 
imating 91. The following bounds are used throughout. 


Lemma C.2 (See also Shalev-Shwartz and Ben-David, 2014, Chapter 26) Let a collection of vec¬ 
tors V C M"- and a sample § := (^i)(Ci be given. 

(i) For any scalar c ^ R and any vq G MF 91(cF + uq) < |c|91(F). 

(ii) For sets {Vj)^^ with Vj C R"- and 0 G Vj for all j > 1, it follows that 91(Uj>iVj) < 
E,>i91(F,). 

(iii) For Zi G and a set of linear predictors W := {z w ■ z : w ^ ||ru||i < B}, it follows 

that 91('W) = 91(W o §) < B sup^gs ||-2||oo-\/21n(2(i)/n. 

(iv) For any L-Lipschitz function £ : M —)■ M, it follows that 91(1’ o F) < L91(F). 


Note that the aforementioned alternate form of 91 using an absolute value breaks (i), whereas it 
strengthens (ii) by allowing the condition 0 G V) to be dropped. 

Proof Proofs of parts (i), (iii), and (iv) can be found in (Shalev-Shwartz and Ben-David, 2014, 
Lemma 26.6, Lemma 26.11, Lemma 26.9); consequently, it only remains to handle (ii). For conve¬ 
nience, define Fx) := Uj>iFj. Given any fixed a G {—1, +1}”, the assumption 0 G V) implies 

sup V ■ a > sup u • cr > 0. 
veVj 
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Table 1: Description of Datasets 


Dataset 

n (#examples) 

s (average sparsity) 

d (dimension) 

20 news 

18845 

93.9 

101631 

activity 

165632 

18.5 

20 

adult 

48842 

12.0 

105 

bio 

145750 

73.4 

74 

census 

299284 

32.0 

401 

covtype 

581011 

11.9 

54 

eeg 

14980 

14.0 

14 

ijcnnl 

24995 

13.0 

22 

kdda 

8407751 

36.3 

19306083 

kddcup2009 

50000 

58.4 

71652 

letter 

20000 

15.6 

16 

magic04 

19020 

10.0 

10 

maptaskcoref 

158546 

40.5 

5944 

mushroom 

8124 

22.0 

117 

nomao 

34465 

82.3 

174 

poker 

946799 

10.0 

10 

rcvl 

781265 

75.7 

43001 

shuttle 

43500 

7.0 

9 

skin 

245057 

2.9 

3 

vehv2binary 

299254 

48.6 

105 

w8a 

49749 

11.7 

300 


Consequently, by Tonelli’s theorem, 


^(Ko) 


E 


f 1 

sup —V ■ a 
\v&Voo n 


= E 


( 1 

sup sup —V • a 
yi>i v&Vj ^ 





< E 




Appendix D. Experiments 

In this appendix we demonstrate that the best performance on a wide variety of data sets can be 
obtained with little or no regularization. While there is some discussion of some methods’ ability 
to seemingly avoid overfitting (Schapire et al., 1997; Friedman, 2000), this observation is primar¬ 
ily folklore, which served as a motivation for our experiments, depicted in Figure 3. They were 
conducted as follows: 
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0.5 1 2 

maptaskcoref 


0.5 1 2 

shuttle 


0,5 1 2 

ijcnnl 


0,5 1 2 

kddcup2009 


0,5 1 2 

nomao 


I V 

I I V - - -' 0.05l-‘-'-1- 


0,5 1 2 

skin 


0,5 1 2 

vehv2binary 



0.5 1 2 

letter 




0.5 1 2 

poker 


0.5 1 2 

w8a 


Figure 3: Proportion of classification errors on various testing sets of linear classifiers frained by 
applying L-BFGS fo regularized logistic regression (ERM wifh logisfic loss); fesf error is on fhe 
verfical axis, and exponenf p of regularizafion coefficienf l/nP is along fhe horizonfal axis. For 
more defail, please see Appendix D. 







Figure 4: Companion plof fo Figure 3; verfical axis is once again fhe proportion of classification 
errors, buf the horizontal axis is now the quantity ||u ;||2 max* ||a:j|| 2 , meaning the norm of the vector 
output by L-BFGS, scaled by the data norm. This quantity is relevant since it appears in the stan¬ 
dard Rademacher bounds for linear functions (see Appendix C and Shalev-Shwartz and Ben-David, 
2014, Chapter 26). For more detail, please see Appendix D. 
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1. We collected twenty datasets from a variety of sources (UCI, KDD Cup, libsvm data reposi¬ 
tory, and a few others), as described in Table 1. 

2. Each dataset was split into 5 different (training, testing) pairs of size (80%, 20%). 

3. i was chosen to be the logistic loss ln(l+exp(-)) and Tf consisted of the coordinates, yielding 
the setting of logistic regression. 

4. We minimized the regularized empirical risk, i.e., ^n{w) + A||u;||2/2, where A was given the 
form I/ttP, where p ranged over {1/2,1, 2, oo}, with l/n°° = 0. 

5. L-BFGS was applied to this regularized variant of for each training/test split and each 
setting of the regularization parameter. Each point in Figure 3 is the median across the five 
splits of the data. Standard E-BFGS code was used (via scikit-learn), with very relaxed 
termination criteria in order to avoid early stopping (pgtol = 10“®, factr = 100). In 
order to provide evidence that early stopping was avoided, please see Figure 4, which roughly 
captures the norms of the selected predictors. 

Note that even as the norm of w increases, the classification error converges, and in most cases 
it is in fact minimized at large norms. It is essential that the plots depict classification error, whereby 
Theorem 1.2 and Proposition 1.3 explain why they behave stably. By contrast, if the goal were to 
recover specific iferafes or confrol fhe loss ifself, fhere are lower bounds indicafing a dependence on 
norms is necessary (Fevy el ah, 2014). 

Appendix E. Properties of classification losses 
E.l. Basics 
Lemma E.l 

(i) If I G L, then lim^^_oo ^( 2 ) = 0, .£*(0) = 0, l*{s) = 00 whenever s < 0, and s G d£{0) 
satisfies £*(s) = minsg]Rf'*(s) = —^( 0 ) < 0 . 

(ii) If£ G then (.' > 0 andliniz^-oo £'{z) = 0. 

(iii) If £ G tt?'*', then liminf^-^-oo £''{z) = 0. 

(iv) If £ G and £ is Lipschitz, then liminf 2 _>.oo £"{z) = 0. 

(v) If £ G then lim4o(^*)^('S) = —00. Additionally, if livny^oo £'f') =: L < 00 , then 
limstL(r)'(s) = 00. 

Proof 

(i) The firsl properly follows from inf^g® £{z) = 0, which for convex £ wilh limz_)._oo £{z) > 0 
implies £ is nof nondecreasing. 

The second properly follows from £* (0) = sup^g]K(0 • r — £{r)) = — inf^gK £{s) = 0. 

Nexl, since lim^_^_oo £{z) = 0, s < 0 implies 

£*{s) = sup(rs — £{r)) > lim (rs — £{r)) = 00. 

rSR r^-00 

Easily, because £ is closed, s G d£{0) implies 0 G £*{s), which is fhe firsl order opfimalily 
condifion, giving £*{s) = min^gR £*( 2 ;). Moreover, by Fenchel’s inequalify, £*{s) + £{0) = 
0 • s = 0, meaning = —("(O), and lasfly £(0) > 0, because £ G L. 


21 


TELGARSKY DUOfK SCHAPIRE 


(ii) If there existed z' with {z'^ = 0, then i" > 0 implies l'{z' — 1) < 0, contradicting the fact 
that I is nondecreasing. 

Next, Mean Value Theorem grants for every z < 0 a G [2z, z] such that 

0 = lim £(z) = lim {£{2z) + I'{qz){z — 2z)) = lim {—z)£'{qz), 
z—^—oo z—>-—oo z—>-—oo 

which necessitates lim2_j._oo = 0 since £' is nondecreasing and £' > 0. 

(iii) Similarly to the above derivation for first derivatives, Mean Value Theorem grants for every 
z < 0 a ^2 G [2z, z] such that 

0 = lim £'(z) = lim (£'(2z) + £"(qz)(z — 2z)) = lim (—z)£"(qz), 

z—>‘—oo z—>-—<yD z—>‘—oo 

which necessitates liminf2_>._oo £"(z) = 0 by positivity of £". 

(iv) Since lim^^-oo £'(z) = 0 (as above) and £" > 0 and £ is Lipschitz, then there exists L > 0 
with lim^-^oo = L < oo. Similarly to the proof of the preceding property, Taylor’s 
theorem grants for every z > 0 a qz £ [z,2z] with 

L = lim £'{z) = lim (/(2z) + £"{qz){z — 2z)) = L + lim {—z)£"{qz), 

2:^00 Z^OO 2^00 

which again necessitates liminf^-s-oo £"{z) = 0 by positivity of £'. 

(v) By strict convexity of £, £* is differentiable over the interior of its domain (Hiriart-Urruty and 

Lemarechal, 2001, Theorem E.4.1.1). By part (i), dom£* includes 0 and s > 0, so we can 
write lim5^o(^*)^('S) = = —oo. Where the last step follows because £' is 

strictly increasing and limr_s._oo £'{r) = 0. 

Given limr^.oo (■'{r) = L < cx), we obtain as before lims'|-L(£*)'(s) = lims'f-o(f")~^(s) = oo. 


Proposition E.2 Let £ G be given. 

(i) The link (p is a monotone increasing bijection between M and (0,1), and moreover continu¬ 
ously differentiable. 

(ii) If (j) is convex over (— oo, 0] and concave over [0, oo), then = £"{£))/ (2/(0)). (This holds 
in particular for the logistic and exponential losses, which therefore have L^p, respectively, 
equal to 1/4 and 1/2.) 


Proof 


(i) Note that 

^ £"(z)£'{-z) + £'{z)£"{-z) 

(/(z)+/(-z))2 


( 10 ) 


which is positive and continuous, because £ G so f is increasing. Note that limr_,.oo (j){r) = 
1 and limr._>._oo fir) = 0, because £' is increasing and limr.->.-oo £'{r) = 0. The bijection 
statement follows by continuity. 
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(ii) By assumption, cj)' is largest at 0. By the form of cj)' given in Eq. (10) above, it follows that 
= f'"(0)/(2£'(0)). The convexity/concavity property may be manually checked for the 
exponential and logistic losses, since they respectively give cj)" to be 

4g2a;(g2x _ _ ;l) 

(1 + ~ (1 + e ^)3 ■ 


E.2. Elements of 

Lemma E.3 Let finite non-null measure fi over 2. and function f : Z ^ M. be given with / > 0 
fi-a.e. and f exp{f)dfj, < oo. Seth := f exp(/)(i/r///(2). Then f exp{f /b)diJ, < 

Proof Note that b > f exp(0)dfi//i(Z) = 1. Consequently, the function r i—)• is concave, and 

thus Jensen’s inequality (applied to the normalized measure /i//i(2)) grants 


exp(//6)d/i = /r(2) j exp(/)^/''d/x//x(2) < fi{Z) exp(/)(ip//i(2)^ = ^(2)6^/^ 


Next it will be shown that the function g{z) := is maximized over (0, oo) at z := e, which 
gives the result. To this end, note 

f{z) = z^l^z-\\-\v,{z)), 


which IS positive for z G (0, 


Lemma E.4 Let finite measure g over 2 and function f : Z ^ M be given with / > 0 g-a.e. and 
g{Z) < 2. If (. G L denotes the exponential loss I = exp, then ||/||/3 < / exp{f)dp/p{Z). 

Proof If f exp{f)dg = oo, there is nothing to show, thus suppose f exp(f)dg < oo. Since 
£"(z) = exp(z) > 1 if z > 0 and < 1 if z < 0, the Taylor expansion yields, for z > 0, 

exp(z) > 1 + z + z^/2 and exp(—z) < 1 — z + z^/2. 

Consequently, for any z > 0, 

exp(z) — (1 + z) > z^/2 > exp(—z) — (1 — z), 

which means j3(z) = exp(z) — (1 + z) when z > 0. Combining this with Lemma E.3, setting 
b := f exp(/)c//x/|u(2) for convenience, 

j /3{f/b)dg = j (exp(//6) - 1 - f/b) dg < g{Z){e^^^ - 1) < 1. 

By the definition of || • ||y3, it follows that ||/||/3 < 6. ■ 
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Lemma E.5 Let finite measure /i over 2 , and function f : Z ^ M be given with / > 0 ix-a.e.. If 
^ G L w L-Lipschitz, then \\f\\^ ^ ^ f 

Proof To start, for any r > 0, since £ is nondecreasing, 

f 3 {r) = max{f'(r) — (£( 0 ) + r/( 0 )), f'(—r) — (-f(O) — rf''( 0 ))} 

< max{f'( 0 ) + rL — (£( 0 ) + r/( 0 )), f’(O) — (£( 0 ) — rf''( 0 ))} 

< r max{L — /(0),/(0)}. 


Setting b := L J i{f)dp/i'{0). 


j /3{f /b)dp < max{L 


< max{L 


< max{L 

< 1 , 


/( 0 ),/( 0 )} 

/( 0 ),/( 0 )} 

/( 0 ),/( 0 )} 


i'jo) J f dp 
Ljiif)dp 
e'{o) J f dp 
L{m+^'iO)Ifdp) 
ifo) f f dp 
Li'i0)ffdp 


where the last step follows because /(O) < L. By the definition of || • ||y3, it follows that ||/||/3 < b. ■ 


Proposition E.6 Let finite measure p over Z with p{Z) < 2 and hypotheses “K be given. Then 
i £ Li having a finite Lipschitz constant L entails C£^^ < L/i'(0), and i = exp entails < 
l/p{Z). Secondly, i = ln(l + exp(-)) entails L^ = 1/4, and i = exp entails L^ = 1/2. Thirdly, 
£ = ln(l + exp(-)) entails ci = 2, and £ = exp entails ci = 1. In particular, in either case, the loss 
is within 


Proof Everything but the bounds on C£ have already been provided by Lemma E.5, Lemma E.4, 
and Proposition E.2. Eor C£, the bound is immediate for £ = exp (since then £ = £'), thus consider 
£ = ln(l + exp(-)). Noting the second-order Taylor expansion of In along [1,1 + <?] with g < 1 is 
ln(l + g) > ln(l) + gln'(l) + inf^gj^ 2] ln"(s)/2, then r < 0 implies 


2 r 

£{r) = ln(l + e"') > - sup ^ 

se[i,2] 



^ o'" ^ 

YJ - Y - 2(1 + e^) 


£'(r) 

2 


Appendix F. Proof of Proposition 1.3 

The proof of Proposition 1.3 is split into two lemmas; first, an upper bound establishing the general 
inequality, and second, an example showing the right-hand side of the inequality can be positive and 
tight. The proof of this upper bound is a straightforward consequence of standard manipulations for 
classification error (Devroye et ah, 1996, Theorem 2.1). 
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Lemma F.l Let probability measure p, hypotheses ‘Ji, and loss I G 'S3~^ be given. For any w G 
Lii-K), 

1) - 1/2)1 < [ 1) - -1))(1 - Mhwix, 1) > l/2])dpx{x) 

Jfi=\l2 

'-V-' 

★ 

where * —)• 0 J {f] — hwldp —)■ 0. 

Proof Following the derivation of Devroye et al. (1996, Theorem 2.1), for any g : X ^ {—1, +1} 
and any x G X, 


Pr[g{X) / y|X = x] = 1 - Pr[g{X) = Y\X = x] 

= 1 - (l[c/(x) = l]?7;,(x, 1) + l[ci(x) = -1)). 

Consequently, for any (71 : X —)• {—1, +1}, 52 : X —)• {—1, +1}, and any x G X, 

Pr[gi{X) / y|X = x] - Pr[g 2 {X) / y|y = x] 

= Vii{x, l){l[g 2 {x) = 1] - l[gi{x) = 1]) + v^{x, -l)(l[52(a:) = -1] - l[gi{x) = -1]) 

= {g^{x, 1) - ?7^(x, -l))(l[52(a;) = 1] - Mhiix) = 1]). (11) 

With this in mind, define gi{x) := l[r7u,(x, 1) > 1/2] and <72(3^) := 1) > 1/2], whereby the 

signs of {Hw){x) and i7^(x, 1) — 1/2 agree, and 

- 1 / 2 ) 

= Pr[g,iX)^Y]-Pr[g 2 {X)^Y] 

= [ {Pr[gi{X) / yjX = x] - Pr[g 2 {X) / yjX = x]) dgx{x) 

Jfi=l/2 

'' -V-^ 

A 

+ [ {Pr[gi{X) + Y\X = x] - Pr[g 2 {X) ^ Y\X = x]) dgx{x). 

Jrii^\/2 

" -V-' 

□ 

To bound these terms, applying Eq. (11) to the first term and using g 2 {x') = 1 along f) = 1/2 yields 


A = / (r7^(x, 1)-r7^(x,-l))(l - l[pi(x) = l])d/ix(a:). 

3f?=l/2 

For the second term, note 


l[pi(x) / 52 ( 3 ;)] < min 


l?7^(x, 1) - 77(x, 1)1 ) 

If7(x,l)-1/21 /• 
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Combining this with Eq. (11), 


!□! < 2 / min 


|ry^(x, 1) - 1/2|, 


/ |r?^(x,l)-l/2| \ 
\\r^{x,l)-l/2\ J 


\r]^{x, 1) - r/(x, 1)1 


duxix) 


with * given in the statement in the statement. To see that * —)• 0 as 
any a G (0,1/2], that 


* < 2 

< 2 


|77-l/2|e(0,(7) 


Idfixix) + 2 


■ \v,{x,l)-l/2\ \ 
/|7?-i/2|><7 V \v{x,l) - 1/2| ) 


1 


/ ldnx{x) + - / \r]w{x,l)-r]{x,l)\dnx{x). 

|77-l/2|e(0,(7) cr J 


WVw — f/lli —^ 0, first note, for 

1) - f/(x, l)|d/ix(a:) 


Since the first term goes to 0 as a —)■ 0, it suffices fo choose a := y/Hr/ui — ??||i and the result 
follows. ■ 


In order to establish the tightness of the bound, consider any e G [0,1), let X = [—1,1], and 
define fhe following probability measure // over X x { —1,+1} = 2.: 


^(x,±l) 


a := —1, b := 1 — e; 

^ m{o) = m{b) = 2^; 

_7?^(a,+l) = 1, ??^(6,+l) = l. 


Lemma F.2 Let scalar e G [0,1), probability measure p as above, hypotheses !K := {h} where 
h{x) = X, and loss I G he given. Then the sequence with Wi := (—l)*/i satisfies 

Puji p and 


OlfiHwi) - 0lfip{-, 1) - 1/2) 


/j7=l/2 

1 

2-e 


{Pi,{x, 1) - Pf,{x, -l))l[r/u,(x, 1) < l/2]dpx{x) 
when i is odd, 


1 —£ 
2 -£ 


when i is even. 


Proof Note that 3? has primal optimum w = 0: evaluating the gradient of fR at tt; gives 

-ai'{-aw)px{a) - b(.'{-bw)px{b) = ^'(0) ^ 

By Theorem 2.1, q = /(O) ^-a.e., thus p = (/(O) = 1/2 ^u-a.e., and 0lfip{-, 1) — 1/2) = 0. 
Turning now to Wi, since p = 1/2 and Pfj, = l[p > 1/2] everywhere. 


3Jz(Tfmi) - 3?z(r?(-, 1) - 1/2) = 3iz(i7r(;j) = ^ /rx(a:)l[ry^.(x, 1) < 1/2] 

xG{a,b} 


lp=l/2 


(p^ix, 1) - pf,{x,-l))l[pviiix, 1) < l/2]dpxix). 
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Moreover, when i is odd, then ‘Jij^{Hwi) = /rx(^). whereas i being even implies = /rx(a)- 

Lastly, the convergence statement follows since Wi —)■ w, thus = (f){Hwi) —)■ (j){Hw) = fj 
by continuity of cj) (cf. Proposition E.2). ■ 


Proof (of Proposition 1.3) The proof follows by instantiating the bound in Lemma F. 1 for each Wi, 
and applying limsupj_^(^ to the absolute value of both sides. On the other hand. Lemma F.2 with 
any e G [0,1) provides the instance with * > 0 and both limsups being equal. Note that the exis¬ 
tence of oscilation exhibited in Femma F.2 does not depend on our particular definition of sign(O). ■ 


Appendix G. Proofs from Section 2 

To prove the main duality result (Theorem 2.1), we rely on a pairing of Orlicz spaces Mp and 
implied by Proposition B.l.iv for a specific choice of /3 infroduced in Eq. (6). We begin by showing 
how fhe norms H-Hyg and IHI^* relafe fo fhe primal and dual objecfives. 

Recall fhaf /3 is a symmefrized version of a loss £ G L wifh fhe firsf-order Taylor expansion af 
zero subfracfed, and if fhus represenfs fhe curvafure of l\ 

^(s) := max |^(s) - (^£(0) + s/(0)^, t{-s) - ^^(0) + (-s)/(0)^ | . 

Nofe fhaf fhis /3 safisfies fhe conditions on 9 in Proposition B.l and if is finife on M, so we obfain 
fhe Banach space pairing befween M^(/i) wifh norm topology and Lp* {fj) wifh weak* topology. 

Lemma G.l Given a finite measure fj, over % and a loss function £ G L, the following hold: 

(i) Iff G then f i{f)dn < oo. 

(ii) /5*(s) < £(0) + min{r(/(0) - |s|), T (f (0) + |s|)}. 

(iii) Let v be any measure absolutely continuous with respect to p, and let f denote its density 
with respect to p, meaning f := du/dp. Then f i*{f)dp < oo implies f G Lp* (p). 


Proof 

(i) Since / G Mp{p) means f fi{f)dp < oo, fhe definition of /3 and properfy £ > 0 granf 

I i{f)dp< I {e{f) + ii-f))dp 

= I (^(/) - (^(0) +/(0)/) +£(-/) - (£(0) -/(0)/))d/i + 2 J mdp 

<2 1 fi{f)dp + 2mp{Z) 

< oo . 

(ii) For convenience, define 

£+{r) := i{r) — (.((0) + and i-{r) := i{—r) — (.((0) — rf''(0)) , 
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and note (e.g., from definition of conjugate or by Theorem 12.3 of Rockafellar, 1970) that 

:=£(0)+r(£'(0) + s) and r (s) := .^(0) + r (/(O) - s) . 

Since /3 = maxj^+jf'-}, then, by definition of conjugate, j5* < yielding the 

result. 

(iii) Let s := /(O). By Lemma E. Li, £* is minimized at s > 0, so it must be non-increasing on 
[0, s] and non-decreasing on [s, oo). Also, by Lemma E.l.i, £*(0) = 0, so <0 on [0, s]. 
Part (ii) therefore implies 




m 

£(0) +r(2|s|) 


if |s| < s 
if |s| > s. 


Eet / = dv/dfi, i.e., / = |/| (/r-a.e.) and assume that f i*{f)d^ < oo. Using the previous 
bound on /?*, write 

I l3*{f/2)dij.<mf,{Z) 

= mK^) 

< mK^) 

< oo , 

where the next to last step follows, because i*{s) > i*{s) = —f'(O) by Lemma E.l.i. 


+ / r(/)d/r 

Jf/2>s 

+ [ t{f)dfl- [ tif)df, 

J Jfl2<s 

+ J r(/)d/r + £(0)/r({//2<5}) 


Proof (of Theorem 2.1) The duality law will be proved via Eenchel’s duality (Theorem A.4). To 
begin, we need to define Banach space pairings. One of them is (Li(9f), L') where L' is the 
topological dual of Li (!K) and the other is (M^(|u), Lp* {(j )), which is a valid pairing as argued at 
the beginning of this appendix. 

We invoke Theorem A.4 with F : Li{‘K) —)• M, G : —)• M defined by 


F{w) = 0 for all w, G{f) = 


i{f)df. 


and A : Li(9f) Mp{n) defined as in Section 1. Note that F*{u) = I[u = 0] where I denotes the 
convex indicator, yielding the constraint AJq = 0. To prove Eq. (7), it remains to show that A is 
continuous as a map from Li(Tf) to Mp{p), G is finite on and 


G*{q) = I t[q)d^i . 


Einiteness of G follows by Lemma G.l.i; the expression for the conjugate G* follows by Propo¬ 
sition A.2, because Mp{pL) and Lp*{n) are decomposable (by Proposition B.l). Einally, to argue 
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continuity of A, consider w, w' G Li{‘K). From the definition of A, |(^t(;)( 2 ;)| < ||re||i, so Aw is a 
bounded measurable function and hence in (by decomposability). Also, 

I — m)) (z)! < lire'— telli . (12) 

Let fi{z) = 1 for all z. For any / and g such that |/| < \g\, we have ||/||/3 < \\g\\i 3 , so Eq. (12) 
implies 

\\A{w'-w)\\^<\\w'-w\\i\\fi \\0 , 

showing the continuity of A, because ||/i ||/3 is finite (by decomposability). 

It remains to show the properties of the dual optima: 

(i) The bound follows since i*{s) = oo whenever s < 0 by Lemma E.l. 

(ii) Any dual optimum q may be modified on a //-null sef fo obfain q safisfying fhe condifion. To 
sfarf, define 5 := {x G X : q{x, 1) = q{x, —1) = 0}; from parf (i), g > 0 (/r-a.e.), so if 
suffices fo produce q by modifying g on a /r-null subsef of S. 

Recall fhaf ry^(x,i/) represenfs fhe conditional probabilify of y given x, i.e., dyL{x,y) = 
r]^{x, y)dyx{x) and r]^{x, —1) + r]^{x, 1) = 1. We will write y = {—1,1}. Eirsf consider 
fhose poinfs where r]ij,{x, y) G (0,1); in particular, fhe sef 

Sq:= {x£ S : r]f,{x, 1) G (0,1)} , 

and, for fhe sake of confradicfion, suppose fhaf ^x('S'o) > 0. Pick s G di{0), whereby 
£*(s) < £*{0) = Oby Lemma E.l. Define g G Lis*{y) as 


q{x,y) 


'Qix,y) 

< s 

g . vA^-y) 
ViiA,y) 


when X ^ So, 

when X G 5o and rif,{x, -y) > ry,{x, y), 
when X G 5o and g^{x, -y) < rii,{x,y). 


We show fhaf g is dual-feasible and achieves a beffer objective value fhan g. By construcfion, 
g G (since g G which is decomposable, and fhe adjusfmenf is bounded), and 

moreover, for every w G Li(T£), 


{A~^q){w)= / {Aw)qdfj.+ / {Aw)qdfi 


'Soxy 




'So 


= / (i/n;)(x) g(x,-l)r/^(x,-l)-g(x, l)7/^(x, 1) d^x(a:) 


+ [ / {Aw)qdyL — / {Aw)qdii\ 

\J Jsoxy ) 


= 0 + ( 0 - 0 ) , 


where fhe lasf sfep follows from fhe definifion of g, feasibilily of g and fhe fad fhaf So C S. 
Thus, g is feasible. On fhe ofher hand. 


t{q)dy= / t{q)diJL + 


'Sox]j 


t{q)dn 
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because q = 0 along Sq x y. By construction, q G (0, s] along So x y. Further, £*{s) < 0 for 
s G (0, s] by Lemma E.l, so £*{q) < 0 along Sq x y. Hence, /rx(5'o) > 0 implies q attains a 
lower objective value than q, a contradiction; thus fix{So) = 0. 

It has been shown that q{x,y) + q{x, —y) > 0 over (x,y) with ri^{x,y) G (0,1), /r-a.e.; 
consequently, it suffices to consider (x, y) with r 7 ^(x, y) G {0,1}. Define q G Lp* (y) as 


q{x,y) 


q{x,y) when r?^(x, y) G (0,1], 
s when r?^(x, y) = 0. 


Since fhe adjusfmenf is only on poinfs where ryj_{x,y) = 0, fhen q = q //-a.e., and fhus is 
also a dual solufion. Furfhermore, since yx{So) = 0, fhen /rx-a.e. over x G 5, we have 
q{x, — 1) + q{x, 1) > s > 0 as desired. 


(iii) This follows direcfly from Theorem A.4 and Proposition A.3.i. 

(iv) Consider a sequence minimizing fhe primal. By Eq. (7) and since A^q = 0, fhis 

means fhaf 


i{Awi)dy + j t{q)dy-{AJq, 


Wi 


(13) 


as z —>• cx). Eel r* 
lo 


Awi. Since {A~^q, Wi) = {q, Awi) = {q,ri), Eq. (13) can be rearranged 


^(ri) + t (q) - qn dy ^ 0 . 


By Eenchel’s inequalify, fhe inlegrand is non-negalive, so we aclually have 


£{ri{z)) + £*{q{z)) — q{z)ri{z) —^ 0 /z-a.e. over z G Z. (14) 

Denofe fhe sel of poinfs z where i* is differentiable al q{z) as S. Define f{z) := {i*y{q{z)) 
forz G S. Over z G S', we have by firsl-orderoplimalily for conjugates lhalf*(q) = qr—£{f), 
and q = ("(f), and fhus Eq. (14) implies 


£{ri) — £{r) — £'{r)(ri — r) —)• 0 /U-a.e. over z G S. 


Hence, from slricl convexity of I we oblain lhal r* —)• f, /r-a.e. over z € S. Now, lei 
Sx := {a; £ 0^^ : (x, 1) G S and (x, — 1) G S} be fhe sef of poinfs x where i* is differentiable 
al bofh q{x, 1) and q{x, —1). Erom fhe definition of r*, we have rj(x, 1) + ri(x, —1) = 0 and 
fhus we musl also have f (x, 1) + f (x, —1) = 0, /zx-a.e. over x G Sx- Unrolling fhe definition 
of f yields fhe desired resull. 

(v) If £ is differentiable, fhen £* is slricfly convex (Hiriarf-Un'uly and Eemarechal, 2001, Theo¬ 
rem E.4.I.2), whereby f £*dy is also slricfly convex by Proposition A.3.ii, and fhus fhe dual 
optimizer is unique up fo /r-null sels. 


To close, nofe an additional technical properly of q which will be useful in various proofs. 
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Lemma G.2 Given finite measure //, hypotheses ‘Ji, and loss i G I?'*' with L := limr_s.c!o (■'{r), it 
follows that every dual optimum q satisfies p{{z G Z : q(z) > L}) = 0. 


Proof Note that £* is strictly convex (by differentiability of £) and differentiable everywhere except 
possibly at the endpoints of its domain (by strict convexity of £). If L = oo, there is nothing to 
show, thus suppose L < oo, which entails dom(f’*) C [0, L] (since the image of the derivative map 
£' is the domain of the conjugate derivative map {£*)', and this coincides, up to the endpoints, with 
dom£*). So it suffices to show that p{q = L) = 0. 

Note that ^'(0) G (0, L) since £” > 0. Define a scalar N ■.= {L f"(0))/2, set Z) := {z G 2. : 
q{z) G (0, L\], and partition D into the three pieces 

Ri\={z eZ-. q{z) G (0, A^]}, 

R 2 ■.= {zeZ-.q{z)€{N,L)}, 

Rs := {z G Z : q{z) = L}. 

We next study the integral f £*((1 — a)q)dp, for small values of a over these pieces. 

(Z? 2 ) Since £* is increasing along [£'{£)), L], then every sufficiently small a > 0 and every z G R 2 
satisfies £*{{1 — a)q{z)) < £*{q{z)), and in particular 



a)q)dp < [ £*{q)dpL. 

JR2 


(Ri) Consider the function 

F{a) = f £*{{1 — a)q)dp. 

jRi 

This is a univariate convex function which is finite on a neighborhood of 0. Pick r > 0 
such that [—T, r] lies in this neighborhood. Since this is a closed bounded subset of the rela¬ 
tive interior of domF, we obtain (by Rockafellar, 1970, Theorem 10.4) that F is Lipschitz- 
continuous on [—r, r]. Let L' be its Lipschitz constant on [—r, r]. For |a| < r, we obtain 



a)q)dp, < aL' + 



£*{q)dp. 


(Rs) Note \mizyL{£*)'{z) = cx) (by Lemma E.l), thus the definition of subgradient grants 



a)q)dp = p{R'i)£*{{f — a)L) 

< piRs) {t{L) - {£*)'{{! - a)L){L - (1 - a)L)) 

= -aLp{R 3 )i£*y{il - a)L) + [ t{q)dp. 

jR-j. 


To finish, first note q G [0, L] for /r-a.e. z G Z (since otherwise f £*{q)dp > f £*(0)dp = 0), 
and £*((1 — a)q) = 0 wherever g = 0. Combining these pieces, since q is optimal and (1 — a)q is 
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feasible for a G [0; !]> then for sufficiently small a > 0, 

J j i*{{1 — a)q)dfi 

= [ I*{{1 — a)q)dn + f i*{{l — a)q)dfi + f £*{{1 — a)q)dfi 

J Ri J R2 Rz 

< ah'+ f i*{q)dfj,+ [ £*{q)d^ — aLiJL{R^){t)'{(1 — a)L) + f £*{q)d^. 

J R\ J R2 R3 

= a(L' - L/i(i?3)(r)'((l - a)L)) + j t{q)d^l, 

which rearranges to give 

L^i{Rs)ity{{l-a)L)<L'. 

' -V-" 

A 

Since L > 0 and A 00 as a i 0 whereas L' is constant, it follows that fJ-iRs) = 0. ■ 


Appendix H. Proof of Lemma 3.2 and Corollary 3.3 

This brief appendix section collects proofs of two results from the introductory part of Section 3. 
Proof (of Lemma 3.2) Applying Theorem 2.1, to both // and 

inf !]l(w) = max 
weLi(/j.) q€L^*(/j,): A^q=0 

inf Ji(w; hd) = max 

weLi(^j,) qeLp*{iio): q={ 

Of course, q attains the first dual maximum over /r; note, as follows, that it also attains the dual 
maximum over First, q is feasible for the second problem, since q £ L/^* (//) and g = 0 on 
so we also have q G L/3* (/td). and for every v G Li(tK), 


- J t{q)d^ , 

- / t{q)d^Ji 
. JD 


0 = J{Av)qdfi = J {Av)qdii. 


Furthermore, since £ G L implies ('*(0) = 0 (by Lemma E.l), it follows that 


t{q)dfi= / t{q)dn. 


Id 


Consequently, 


max 

q&Li3*{tM) 

A^q =0 


- I t{q)d^^ 


J tiq)dfi = - Jj*{q)df, 


< 


max 

q&Lp* (p-d) 
A^q =0 


Id 


t{q)dfi 


(15) 


Now consider any dual optimum qu over fijy, and set q{z) := qD{z)l[z G D]. Mimicking the 
derivations above, q is feasible and optimal over D (indeed, qo and q only differ on a //£)-null set). 
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Similarly, however, q is also feasible for the full problem over /r^, and f t{q)^D = / 
implying that the inequality in Eq. (15) is an equality, and q and q are optimal for both jj, and B 


Proof (of Corollary 3.3) Using the fact that the dual and thus also primal optimal values coincide 
for fj, and /x^), as well as the fact that £ > 0, we obtain 


E{w,flD) 


Id 


i{Aw)dfi 


inf [ £(Av)dfj, < [ i{Aw)d^ 
v&Li{‘K)Jd J 


inf 

veLi{5{) 


£{Av)dfj, = £(tu) 


directly, and similarly 




< 


£{Aw)dfi 

i{Aw)dfj, 

i{Aw)dfj, 


= E,{w). 


i{Aw)dfj, 


JD 

inf [ i{Av)dfi 

veLi{K) J£) 

inf [ i(Av)dn 

^eLi(Tt) J 


Appendix I. Proofs from Section 3.1 

Before proving Lemma 3.5 in full, we establish a general form of its first part. Unlike the proof 
of the second part of Lemma 3.5, the first part does not rely upon the structure of 2)'^ in any way; 
indeed it is simply Markov’s inequality. 

Lemma I.l Let finite measure /x, hypotheses “K, loss £ G L, and arbitrary set C (L Z, be given. 
Then for any w G Li(2f) and r > 0, the set Sr '■= {z ^ C : i[{Aw){z)) > r} satisfies 
p{Sr) < lR{w;fic)/r. 

Proof Emulating the proof of Markov’s inequality, every z £ Z satisfies 

rl[z G S'r] < rl[l[{Aw){z)) > r] < l[{Aw){z)), 
thus integrating both sides along C and dividing by r gives 

(5 1 < ^ ^{w;pc) 

^ — r r 


Proof (of Lemma 3.5) Part (i) is proved by applying Lemma I.l with C := D^, and then applying 
Corollary 3.3 for the inequality Jl(w; < L{w). 
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For part (ii), first note that if r > f(0), we are done, because \ri — < 1. Now consider 

r < ^(0). Since f/ = 1 for /r-a.e. z ^ and > 0 by definition, then 



|r/ - r]^\dfi 



(1 - r]^)dfi 


i'{{Aw){x,y)) 


lD-\Sr y)) + i'{{Aw){x, -y)) 


dy =; 9, 


thus it remains to control Since every z £ D'^ \ Sr has £(^{Aw){z)) < r < f’(O), the increasing 
property of i implies {Aw){z) < 0. Consequently, it follows that £'(^{Aw){z)'j < C££((Aw)(z)') < 
C£r, and also that £'{{Aw){x, —y)) = £'{—{Aw){x,y)) > £'{0) since £' is nondecreasing by con¬ 
vexity. Combining these bounds. 


9 


L 


D-\Sr l + ^'i{Aw){x,-y))/£'{{Aw){x,y)) 


dy{x,y) < 


y{D^ \ Sr) 

1 -hf (0)/(Qr)’ 


which gives the desired bound after rearrangement, noting that Qr > 0. 


In order to prove Lemma 3.6, it will be necessary to establish an additional structural property 
of dual optima. In particular, recall the function /, which is used in the proof of Lemma 3.6, and 
which is equal to {£')~^{q) whenever {£')~^ is defined for both q{x, y) and q{x, —y). It is this final 
condition—needing both (x, y) and (x, —y )—which requires the extra work here. 

For the purposes of Lemma 3.6, it will suffice fo establish that y-a..e (x, y) £ D satisfies 
(x, —y) £ D, which is precisely the following lemma. This result is in fact a consequence of 
Lemma 3.5: the idea is that for those points with (x, y) £ T> but (x, —y) £ applying Lemma 3.5 
grants that every low error predictor must achieve small error on this latter set. But this leads to a 
contradiction, since it necessitates that the error on the mirrored points, which reside in 2), must be 
large. 

Lemma 1.2 Let finite measure y, hypotheses Ji, and loss £ £ be given. Then there exists a 
dual optimum q and corresponding dijficult set D such that y-a.e. over {x,y) £ T> we also have 
{x,-y) £ D. 

Proof Let an arbitrary dual optimum % be given as provided by Theorem 2.1, and let Dq denote the 
corresponding difficult set. If this provided % already satisfies the necessary properties, the proof is 
done, therefore suppose it does not. 

Define three sets 


Ko ■= {(a:, y) £T)o: (x, -y) £ Dg, r/^(x, y) = 0} , 

Ki := {(x, y) G Do : (x, -y) G Dg, y^(x, y) = 1} , 

K+ ■= {(a:, y) G Do : (x, -y) G Dg, y^(x, y) G (0,1)} , 


and an adjusted dual optimum 


q{x,y) 


0 when (x, y) G Ko, 

< £\0) when (x, —y) G Ki, 
^qQ{x,y) otherwise. 
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Since = 0 = /i({(x,y) : {x,—y) G Ki\) by construction, then q = % ^-a.e., meaning 

q is also a dual optimum to Eq. (7). Defining V := {z G Z : q{z) > 0}, if {x,y) G T> and 
(x, —y) G T>‘^, then it must hold that (x,y) G K+. The proof is done if = 0; this will 

constitute the remainder of the proof. 

Assume contradictorily that y{K+) > 0. Define 

■= {{x,y) G K+ : min{??^(x,y),r/^(x,-y)} > . 

By continuity of measures, lim^^o thus there exists a fixed r > 0 

so thaf U := Ur has yi{U) > r. For convenience, define U- := {{x,—y) : {x,y) G U} (and 
use S- for this “flipped sign” transformation of any set S C Z). By the conditions on U, then 
y{U-) > Ty{U) > r^, and for any set C <G Z, 

y{UnC-)>Ty{U-nC). (16) 

Now choose eo > 0 so that £{—£~^{y/eo)) > 6fR(0)/r^, set e := min{eo,'r^/4,3^(0)}, and 
choose w G Ti(flf) with £(re) < e. Applying Lemma 3.5 to w with r := y/e, the set 

Sr := {z G T)‘^ ■. i{{Aw){z)) > r} 

satisfies y{Sr) < ejr = r. For convenience, define V := \ Sr, whereby y{V) > y{D) — r, and 

every z G V has £{{Aw){z)) < yfe, which will be more useful in the form {Aw){z) < l~^{y/e). 
Furthermore, since U- C 2)'^, 

< yi{U-) = y{U- nV) + y{U- n V^) < y{U- n F) + n V^) = fi{U- n F) + y{Sr), 
which rearranges to give y{U- n F) > — fJ.{Sr) > — y/s > r^/2. Note by Eq. (16) that 

y{unv.)>Ty{u.nv) > 

and z G V- has {Aw){z) > —£~^{y/e), and more importantly l{{Aw){z)) > l{—l~^{y/e)) > 
63J(0)/r^. Consequently, since £(m) < fR(0) and 31(0) > 0, 

3?(rc) > / £{Aw)dfi{x,y) 

JunV- 

>y{UnV.)£{-£-\y/e)) 

> 33J(0) 

> 3i(0) + £(tc) + inf fR(i;) 

vGLiCK) 

= 3i(0) + 3?(t(;) 

> 3i(r(;), 


a contradiction. 


Proof (of Lemma 3.6) First consider by £ > 0 and convexity, £(ci) > £(0) + ci^'(0) > ci£'(0), 

and 

3?(rc) > f £{Aw)dfi > £{ci)y{S^) > ci£'{0)y{S+), 

Js+ 
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which rearranges to give /r(*S'+) < 3?(tc)/(ci£'(0)). 

To control 5_ we take advantage of S+: the region D is a set of points where it is impossible 
for S- to be large without 5"+ being large as well, and g is a witness to this fact. To start, note by 
A^q = 0 and q = OonV^ that 


0 = {Aw, q)= {Aw)qdfi = / {Aw)qdfi + / {Aw)qdfj,, 
Jd J Aw>{) J Aw<{) 


which rearranges to yield 


/ {Aw)qd^ = — {Aw)qdfi. 

J Aw>Q J Aw<0 


f Aw>0 J Aw<0 

Combining this with Holder’s inequality for Orlicz spaces (see Proposition B.l), 


2 \\q\\j 3 * \\Awl[Aw > 0]||^ > \{q,Awl[Aw > 0])| 

= \{q, Awl[Aw < 0])| 

> ClC2^l{S-). 


Now using the definition of and rearranging. 


> 


> Aw>{) 


£{Aw)dfj, > 


CiC2^(5-) 


which gives the desired bound on /i(S'_). 

In order to control jf/ — on U, suppose without loss of generality that /r-a.e. {x,y) G T) 
satisfies (x, —y) G D (see Lemma 1.2), and define a scalar L := lim,._>.oo ^{f), a set V := {z G 
Tt : z < L], and a function 


m 


{ty{q{z)) when 2 ; G D', 
0 otherwise. 


Note that / is well-defined (and measurable) by construction, since strict convexity of i implies 
differentiability of i* along the interior of dom(£*) (Hiriart-Un'uty and Lemarechal, 2001, Theorem 
E.4.1.1), which coincides with the set D' (because the domain of {£*y is the image of I' by first-order 
optimality for conjugates). By Taylor’s theorem, for every z ^ T)' there exists qz G [{Aw)z, f{z)] 
with 


i{{Aw){z)) = £{f{z)) + £'if{z)){{Aw){z) - f{z)) + ^{{Aw){z) - /(z)) V(fe) 

= + q{z){{Aw){z)) + ^{{Aw){z) - f{z)f£"{qz) 

> -£*{q{z)) + q{z){{Aw){z)) + ^{{Aw){z) - f{z))‘^l[z G U], 

where the second line made use of q{z) = £'{f{z)) and FencheTs inequality. All terms in this final 
bound are integrable over T>', and moreover either T) = V', or L < 00 and y{D \ T>') = 0 by 
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Lemma G.2, thus applying to both sides gives 


L 


£{Aw)d^ > 


L 


t{q)d^jL + {q,Aw) + 


^ - ffdi^ 

{Aw - ffdn, 


inf / 

v&Li{^) Jx) 


L 


i{Av)d^ + 


r 


2 


u 


which made use of A~^q = 0 and the fact that q also maximizes the dual problem restricted to ^x) 
(by Lemma 3.2). Rearranging the preceding Taylor expansion gives 


X 



(17) 


The next step is to convert between / and r). To this end, recall from the construction of / and subse¬ 
quent discussion that /(z) = {i*y{q{z)) for /r-a.e. z G 2) (and /r-a.e. (x, y) G T> has (x, —y) G D), 
thus Theorem 2.Liv grants 


fix, y) = {t)'{q{x, y)) = -{£*)'{q{x, -y)) = -f{x, -y) for /x-a.e. (x, y) G V. 


In particular, this grants cj){—f{z)) = r]{z) for /x-a.e. z G D, which combined with Eq. (17) and the 
notation for the Lipschitz constant of cj) means 



where the penultimate step used Jensen’s inequality. 


Proof (of Theorem 1.1) First note that the bound for a single w G Li(2f) immediately implies the 
convergence result, thus it suffices to prove the bound. 

To this end, let w G Li(T{) be given, set e := E.{w), and before defining /i (which will nof 
depend on w), define fwo helper functions: 

r(r) = inf £”{z) , 

\z\<r 

uin {r > 0 : T(r) < 2^/e} if 2y/e < r(l). 



if 2^ > r(l). 


The key properties are fhaf T(r) > 0, if is continuous, non-increasing, and limr_>.oo T{r) = 0 
because liminfr_>._oo £''{r) = 0 (by Lemma E.l). On fhe ofher hand, fhe definition of implies 
fhaf 


r(pe) = min{2\/e, r(l)} 
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which means that g'g —)• oo as e | 0 . 

Next, /i will be constructed by splitting \\fi — r]u,\\i along D and and subsequently using 
Lemma 3.6 and Lemma 3.5 to control each term. When applying Lemma 3.5, the bound may 
be simplified by using r := -y/e and \ Sr) < 1. When applying Lemma 3.6 (and using 
Corollary 3.3 to give £(?/;;/i®) < e), the bound may be simplified by seffing ci := ge, C 2 := 
max{C]^ ge)}, and C 3 := C{ge)- Wifh fhese definitions, if follows fhaf fhe r of Lemma 3.6, 

which equals min{inf| 2 |<ciinf^e[c 2 ,c 3 ] coincides wifh r( 5 £). ^3 < C 2 , sef 

/i(e) = 1; ofherwise. Lemma 3.6 may be applied, and fogefher wifh fhe terms from Lemma 3.5 it 
follows that 


j\g-gy,\dfi 



\g - gw\dg + 



\g - gw\dg 


< 'Je + Ve max 


1 Q 1 

£(0)’ f(0) J 


+ ( e + inf 
V v&Li(^) 


i{Av)dg 


( 1 

Ue^'(O) J 



-k 


+ g{{z E Z : q{z) E {0,mayi{i'{-g^), g^ V q{z) > {ge)}) 


=■■ /i(e)- 


By construction, /i is well-defined, does not depend on w, and satisfies the desired inequality; it 
remains to be shown that /i(e) —)■ 0 as e J, 0. It suffices to consider A and *, since all other terms 
contain e in a numerator, or g^ in a denominator (where, as shown before, —)■ 00 as e | 0 ), 
without any worry of cancellations mitigating these effects. 

To handle A, first expand the terms as 

A < /i({z E 2, : q{z) E (0, max{/(-c/£), 5 “^/^})}) + g,{{z E 2 : q{z) > {ge)]) ■ 

'-V-' '-V-' 

□ 0 

_ 1 /o 

□ —0 as e i 0, since ge ^ —)• 0 as e | 0 and since £'{—r) —)• 0 as r —)• 00 by Lemma E.l. Lastly, 
to show 0—^0, there are two cases. First, if £' grows unboundedly, then £'{ge) will cover all values 
as e i 0. On the other hand, if L := limj .^00 ^'{r) < 00 , then g{{z E 2 : q > L}) = 0 as provided 
by Lemma G.2 means once again that £'{ge) will cover all values (/r-a.e.) as e | 0. 

Lastly, to handle *, we use T{ge) = min {2^/e, r(l)} to obtain that 

L^y/2£/T{ge) = max{3/2£/r(1), , 

which goes to zero as e — 0 . ■ 
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Appendix J. Proofs from Section 3.2 

As in the main text, this appendix first develops the quantity Bal(/r), and then uses it to develop the 
deviation bounds. 

J.l. Basic properties of Bal(/i) 

To start, note the range of values for Bal(/u). When Ker(/x)-*- ^ {0}, then Bal(/r) G [0, /r(Z)], but 
the case Ker(/i)-*- = {0} means, via usual conventions on infima, that Bal(/i) = oo. This represents 
a certain degeneracy in the learning problem; indeed, it is a scenario where there is nothing to learn, 
since equivalently Ker(/i) = and thus every element of has no impact on the problem. 

With this in mind, the first lemma relates boundedness and risk. 

Lemma J.l Let finite measure /r, hypotheses “K, loss ^ G L, and s G 9^(0) be given. Then every 
w G Ker(/r)-*- satisfies 

“ sBal(/i) ■ 

Proof By definition of Bal(/i), since i >0, 

1^{w) > / £{Aw)dp > / (£(0) + > ||m||isBal(^), 

JAw>0 JAw>0 

which rearranges to give the result. (As a sanity check, the case Bal(^) = oo means Ker(/r)-*- = {0}, 
whereby ||in||i = 0 automatically.) ■ 


Next, note that the infimand within the definition of Bal(//) is Lipschitz continuous. 

Lemma J.2 Let finite measure p and hypotheses “K, \'K\ = d, be given, and define the function 
f{w) := f^_^^Q{Aw)dp. Then, for every w,w' G 

\f{w) - f{w')\ < ll-u; - w'\\ip{Z). 


Proof Let w, w' be given, and define N{v) := {z G Z : {Av){z) > 0}. Since \ h{z)\ < 1 for every 
h G TC (whereby \ {Av){z)\ < ||n||i for every z and v). 


II 

1 

/ A{w — w')dp + / 

{Aw)dp — 

[ 

{Aw')dp 



JN{w)nN{w') Ji 

V[{w)\N{w') 

In{ 

w')\N (w) 


< 

{w — w'\\ip{N{w) n N{w')) + 

/ {Aw)dp 

+ 

[ {Aw')dp 



Jn(w)\N(w') 


J N {w')\N [w) 



Since fhe second and fhird ferms are symmefric, if suffices fo consider fhe second. To fhis end, nofe 
fhaf 

zGN{w)\N{w') (Am)(2;) > 0 

and 

zGN{w)\N{w) {Aw){z) = {Aw'){z) + {A{w — w')){z) + \\w — w'\\i, 
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which combine to give 

z e N(w) \ N{w') |(74i(;)( 2;)| < ||ui — w'lli, 

and thus 


In(w)\N{w') 


{Aw)dfj, 


< 


' N(w)\N{w') 


\Aw\ < ||m — w'\\ifi{N{w) \ N{w')). 


The result follows. 


It will now be shown that Bai{fj,£)) > 0 whenever > 0. To prove this, the preceding 
lemma showed that the infimand in the definition of Bal(/i) is continuous; on the other hand, since 
I Tf I < oo, the domain of the infimum is compact, which together with the aforementioned continuity 
gives attainment at a necessarily positive point. 


Lemma J.3 Let finite measure n, hypotheses Ji, loss f G L, and dual variable q G L^*(p) with 
q > 0 p-a.e. and AJ q = 0 be given, and set D ■.= {z : q{z) > 0}. If p,{D) > 0 and |fK| < oo, 
then Bal(/r£)) > 0. 

Proof If Ker(/i)-*- = {0}, then Bal(//) = oo > 0 immediately, thus suppose Ker(/i)-*- is a nontrivial 
subspace, meaning in particular that there exists w G Ker(^)-*- with ||rc||i = 1. By Lemma J.2, the 
map w f^^^f^{Aw)dpD is continuous; since moreover the (nonempty) set C = {m G Ker(/x)-*- : 
||t(;||i = 1} is compact when ITf I < oo, it follows that the minimization in the definition of Bal(/i£)) 
is attained at some point in C. The remainder of the proof establishes that the integral is indeed 
positive everywhere on C. 

Consider any w e C. Since C fl Ker{p£)) = 0, it must hold that pd{{z G 2. : {Aw){z) / 
0}) > 0 (else w G Ker(|U£))), and thus at least one of the two expressions jj^^^^{Aw)dpD and 
jAw<o{^'^)dhD must be nonzero. If the first is nonzero, it is positive, and the proof is done, thus 
suppose that only the second expression is nonzero, which necessarily means it is negative. Since 
AA q = 0, then {Aw, q) = 0, which can be split into negative and positive parts to yield 


/ {Aw)qdpD = — {Aw)qdp,D > 0 

JAw>0 JAwKO 


as desired. 


Lemma J.3 was stated for general q G Lp*{p) due to its use in future lemmas; however, by 
instantiating it for a dual optimum q, it follows that Bal(//D) > 0 whenever /r(2)) > 0. 

Proposition J.4 Let p be a finite measure, IK be a hypothesis set with |fK| < 00 , and i € h be a 
loss with corresponding difficult set V. Then Bal(/rD) > 0 whenever p{T)) > 0. 

Proof The result follows by applying Lemma J.3 to q and T), noting that they satisfy the desired 
properties by Theorem 2.1 and the definition of T). ■ 


The next two properties will establish the interplay between Bal, T), and also primal-dual opti¬ 
mal pairs {w,q). 
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Lemma J.5 Let finite measure hypotheses !K with |5{| < oo, loss i € h, and s G d(.{Q) be 
given. > 0, then there exists a primal-dual optimal pair (w, q) to Eq. (7) which satisfies 

w G Ker(/i)-*-, and ||u)||i < £(0)^(2.)/(sBal(//)), and q{z) G di{{Aw){z)) for p-a.e. z € Z. 

Proof From the definition of Ker(/r), it suffices to optimize the primal over Ker(/r)-*-, and by 
Lemma E.l, the primal optimization can be further restricted to the compact convex set 

|t(; G Ker(/i)“'“ : ||m||i < f'(0)/x(Z.)/(sBal(^))| , 

where a minimum w is attained (by continuity of convex functions on M*^). The relationship with q 
follows by Theorem 2.1. ■ 


The remainder of this subsection will build towards the construction of the canonical difficult 
set 2)*: the difficult sets 2? provided by losses in are “maximal” in the measure-theoretic sense. 
To this end, the following lemma is essential. 

Lemma J.6 Let finite measure p, hypotheses 2f with \!K\ < oo, loss £ G and a corresponding 
difficult set D be given. For any set S with Bal{ps) > 0, then p(S \ 2)) = 0. 

Proof Suppose contradictorily that p{S \ 2)) > 0, which entails p{S) > 0, and let q denote the dual 
optimum associated with 2). 

Applying Lemma J.5 to loss £ and measure ps^ it follows from Bal(/i 5 ) > 0 that there ex¬ 
ists a primal optimum ws and corresponding dual optimum qs with qs G d£{Aws) and 

consequently ^5 > 0 p-a.e.. since £ G 11^“'“. 

Define q{z) := q{z) + qs{z)l[z G S], whereby, for any w G 

{Aw,q) = {Aw,q) -\- {Aw,qsl[z e S]) = J {Aw){q)dp J {Aw)qsdp = 0-\-0. 

Additionally, q > 0 p-a.e. with q > 0 p-a.e. along D := T)VJ S, thus Lemma J.3 and Lemma J.5 
may be applied to obtain a dual optimum qjo which is positive p-a.e. along D. Of course, q was 
feasible for the problem restricted to D, and by strict convexity of f £*{q)dp (see Proposition A.3), 
it follows that f £*{qD)dpD < / £*{q)dpD- But this is a contradiction, since z 1 —)• qD{z)l[z G D] 
is feasible for the full problem without changing its objective value, and meanwhile q was optimal 
for the full problem. ■ 


Proof (of Proposition 3.9) We will show that if ^1 G L and £2 G with corresponding difficult 
sets 2)i and 2 ) 2 , then p{Vi \ 2 ) 2 ) = 0, which will yield the proof. For (i), it suffices to instantiate 
the claim with £i = £ and £2 = exp, and (ii) follows by instantiating the claim once with £i = £ 
and £2 = exp, and a second time with £i = exp and £2 = £. 

The proof of the general claim is as follows. If /r(2)i) = 0, then p{'Di \ 2 ) 2 ) = 0 automati¬ 
cally, thus suppose /r(2)i) > 0. In this case. Lemma J.3 grants Bal(^D J > 0, and thus applying 
Lemma J.6 with loss £2 and S := 2)i gives p{‘Di \ 2 ) 2 ) = 0. ■ 
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J.2. Splitting ‘Jin along D and 

As granted by the development of Bal(/i), recall from the main text that there exists a canonical 
difficult set 2)*, which by Proposition 3.9 is not tied to any specific loss. The goal of fhis section is 
fo show, as sfafed in Lemma J.8, fhaf fR can be splif along 2)*, jusf like 2? (cf. Corollary 3.3), despife 
2)* being consfrucfed over ^ rafher fhan fl. 

As fhe firsf sfep, we esfablish fhe exisfence of arbifrarily good predictors over 2)^. 

Lemma J.7 Let finite measure fr, hypotheses !K with \‘K\ = d, and canonical difficult set be 
given. Then for every e > 0, there exists n G such that {Av){z) = 0/or pL-a.e. z G 2)*, and 
p{{z G 2)$ : {Av){z) > -1}) < e. 

Proof Throughouf fhis proof, sef £ := exp G whereby 2) = 2)* by definifion of 2?*, and lef 
e > 0 be given. 

There are now fwo cases fo consider; firsf consider fhe simpler case /r(2)*) = 0. Choose any 
V G Li{p) wifh L{v) < and firsf note fhaf {Av){z) = 0 for /x-a.e. z G 2)* wifhouf any 

efforf since /r(2)*) = 0. On fhe ofher hand, by Lemma 3.5 (wifh r := £(—!)), 

MU e n : (.Av){z) > - 1 )) = MU £ n : t((Av)U)) > <(-!))) < 

which completes fhe proof under fhe assumption /i(2)*) = 0. 

Now consider fhe case ^(2)*) > 0, whereby Proposition J.4granfsBal(/xx)^) > 0. Lefs G di{0) 
be arbifrary and sef 

^ 1 + inf^eR^ 2^(^;J 

sBal(/rDj 

whereby Lemma J.l granfs fhaf every w G Ker(;UD^)-*- wifh 8-{w, < 1 satisfies 

N 1 < -zz-r, -r < 2>- 

sBafipxiJ 

Now lef e > 0 be given, and choose eo G (0, min{e^, 1}] such fhaf i~^{^/so) < —1 — B. Lef 
u G be given wifh £(rt) < eo, whereby Corollary 3.3 granfs fhaf max{£(M; 2(u; < 

eo as well. By Lemma 3.5 wifh r := and fhe above definitions, 

e > e 25* : e{{Au){z)) > v/e((}) 

= p{{zeVl:{Au){z)>r\^o)}) 

>p{{z€Dl:iAu){z)>-l-B}). 

Now write u as fhe direcf sum u = v (B u±, where v G Ker{p,-x)^) and u± G Ker(/rD^)-*-. By fhe 
earlier derivation, ||u±||i < B, and fhus, for any z G 2., we have |(Att_L)( 2 ;)| < 2?, and 

(Au)( 2 ;) > — 1 =;> {Av){z) >—1 — B — {Au±){z) {Au){z) >—1 — B. 

This combines wifh fhe earlier derivafion fo yield 

e > G 2)^ : {Au){z) > -1 - B]) > p{{z G T>1 : {Av){z) > -1}) 
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as desired. 


Thanks to the preceding lemma, splitting into D* and is straightforward (and similar to 
the proof of Corollary 3.3). 

Lemma J.8 Let probability measure p, hypotheses Tf with |!K| = d, and canonical difficult set 2?* 
be given. Then, for any 5 > 0, with probability 1 — 5 over a random draw of size nfrom p, every 
loss i G satisfies 

inf 3l(w:p)= inf lR(w:pr)f 

and for every w 


8-{w,p'j)^)<E,{w,p) , lR{'w,p'X)c)<8,{w;p) . 

Proof Let S to denote the sample, where Sc := S n T)^ with size ric '■= |Sc| denotes the portion 
falling within D^, and Sd := S Cl D* the portion falling within 2?*. If Uc = 0, then all claims follow 
immediately (indeed, this implies p = pxi,, and 2?(t(;; /idc) = 0), thus suppose ric > 0. 

By Lemma J.7 with e := p{T)f)£Q and eo •= min{l/2, — ln(l — 5)/(2nc)}, there exists v gM.'^ 
satisfying {Av){z) = 0 for ^-a.e. z G 2)*, and 

p{{z G T)l : {Av){z) > -1}) < e = /r(2)^)eo, 


or equivalently 

/X|dc({ 2 : G Dl : {Av){z) > -1}) < Eq ■ 

Consequently, with probability at least 1 — 5 over the draw of S, conditional on ric, we obtain 
{Av){zi) = 0 for every Zi G Sd and {Av){zi) < —1 for every zi G Sc, the latter statement since 

Pr[Vi G Sc, {Av){zi) < -1] = /U|i,c({ 2 : G 2)^ : {Av){z) > -l})''^ 

>(l-eor 

> (1 - (2eo) + (2eo)V2)”^ 

> exp(-2nceo) 

>1-5. 


Since lim 2 _ 5 ._oo iiz) = 0, every w gM.'^ and Zi G Sc satisfies infc>o i{{A{w + rv)){zi)) = 0, thus 
inf Jl(w]p)= inf / i(A(w + rv))dp 

/ i{A{w + rv))dp + / £{A{w + rv))dp\ 

.2d* 2ds / 


r>0 
= inf 

uiSM'* 

r>0 


= inf / 

\ 2d, 


i(Aw)dp + inf / £{A{w + rv))dp 
^>0 2ds / 


= inf 'Jl{w]p^fj. 
w£.R<i 
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For the last part, proceeding similarly to the proof of Lemma 3.2, the above derivation and £ > 0 
grant 

= / £{Aw)d'jl — inf / £{Av)d1i < / £{Aw)dfi — inf / £{Av)d'll = E-{w,p,) 

2d* ■ueRd 2 d* 2 J 

directly, and 

1R(u);/2d;) = J £{Aw)dli — J l{Aw)d'fl 

< [ £{Aw)d'li — inf [ l{Av)d'jl 
J dsM'* 2d* 

= [ £{Aw)d'li — inf [ i{Av)d'fl = 

J vm‘^ J 


J.3. Controlling deviations over D* 

This section will establish the deviation bound over 2?*, namely Lemma J.9. Superficially, this is 
merely an application of the VC theorem, however there are two issues under the surface. 

First, note that this lemma does not attempt to control E{w; which of course would allow 
a direct application of Lemma 3.5 and ostensibly an easy analysis over D* within the proof of 
Theorem 1.2. The reason is that there is evidence £(u); //d=) cannot be controlled without placing 
strong restrictions on u) G (Levy et ah, 2014). On the other hand, the margin-like bound here is 
sufficient to aid in the proof of Theorem 1.2. 

The second issue is that 25* is an object constructed over /r rather than /2, which is circumvented 
via Lemma J.8. 


Lemma J.9 Let probability measure p, hypotheses 2f with |2f| = d, loss £ G L, and canonical 
difficult set 2)* with p{T)f) > 0 be given. Then with probability at least 1 — 25 over an i.i.d. draw 

of size nfrom every w GMf and e > 0 with e > Ln{w) satisfies 


{{z G 2)J : £(Aw) > e}) < 


/32(1 + d) ln(l + n) + 41n(l/(5) 
/2(2)5) n 


Proof Since the set of linear threshold functions with weight vectors in has VC dimension 1 + d, 
the nondecreasing property of £ combined with the VC theorem grants (Boucheron et ah, 2005), 
with probability 1 — 5, 


sup l/xpc {{z G 2)* : £{Aw) > s}) — /rpc {{z G 2)* : £{Aw) > s})| 

uieiR'* 

s>0 

< sup I^IDC {{z G 2)* : (Aw) > r}) - /ip; {{z G 2)* : (Aw) > r})| 

uiSK'* 

rSM 


< 


16(1 + d) ln(l + n) + 21n(l/5) 
n 
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Now let td G be arbitrary. Instantiating the above display with w and s := e > 0, and then 
applying Lemma 1.1 on measure and set with scalar r := e > 0, it follows that 


/xpc {{z E VI : £{Aw) > e}) < 


+ 


^|Dj) /l6(l + d) ln(l + n) + 21n(l/(5) 


n 


To finish, Lemma J.8 grants 3i(ui;/2|Dc) = 0l{w;'pxi^)/Jl{V1) < £„(in)/^(2)5) after discarding 
another 5 failure probability, and the result follows by plugging in e > w Eniw). ■ 


J.4. Controlling deviations over D*: Proof of Lemma 3.10 

In order to establish Lemma 3.10, two lemmas are in order: the first shows that Bal(^) is statistically 
stable, and the second develops a refined deviafion bound for over D*. 

Lemma J.IO Let probability measure p, hypotheses Tf with \‘K\ = d, and canonical difficult set 
D* with p(Vi,) > 0 be given. Then with probability at least 1 — 5 over a draw of size nfrom 
Ker(^ID^) C Ker(/i| 2 )^), and 


BaKPi^J > Bal(M|^.) - 

Moreover, ifn > 256 (ln( 2 d+ln( 4 / 5 ))/Bal(/i| 23 ^)^, then^dl{p\q)f) > Bal(/r|D^)/ 2 an<iKer(/r|ii^) = 
Ker(/I|Dj. 


Proof Lef § := denofe fhe random draw from 

Firsf if will be shown fhaf Ker(/i|D^) C Ker(p|x)^) wifh probability 1. If Ker(/i|D^) = {0}, fhen 
fhe claim is immediate, thus suppose Ker(/i|D^) is a nontrivial subspace of M'^. Pick an orthonormal 
basis for Ker(/r| 2 )^). For each Wj, define Nj := {z ^ Z : {Awj){z) / 0}, whereby 

= 0 since Wj E Ker(/i|x)^). Since = 0, fhen wifh probability 1 over fhe 

draw of sample S, every Zj E S and Wj satisfy {Awj){zi) = 0. Consequenfly, given an arbifrary 
w E Ker(/r|D^), fhere exisf scalars such fhaf w = Yl’j=i thus, for every Zi E S, 

by linearify 

k 

{Aw){zi) = '^aj{Awj){zi) = 0 , 

i=i 

meaning w E Ker(/2|ii^) as well. Hence, Ker(//|D^) C Ker(/i|ii^). 

Throughouf fhe remainder of fhis proof, discard fhe failure even! for fhe above confrol on 
Ker(/2|D^): in particular, suppose Ker(/i| 2 )*) C Ker(/i|D^), and equivalenfly Ker(/X| 2 )^)-*- D Ker(^| 23 ^)-*-. 

In order fo produce fhe lower bound on Bal(^|D^), firsl consider fhe case Bal(/i|D^) = oo. This 
means Ker(/r| 2 )^)''' = {0}, which combined wifh Ker(^| 2 )^)“'' 2 Ker(F|i)*)''' means Bal(/i|x)^) = 
oo as well, giving fhe desired bound. 

Now consider fhe case Bal(/X| 2 )^) < oo. Firsl note fhaf fhe map z i—)> max{(zlt(;)( 2 ;), 0} is 
fhe composition of fhe 1-Lipschilz univariate map max{-, 0} logelher wifh a linear function, so by 
Lemma C.2, if has Rademacher complexity ||u)||iy^2ln(2(i)/n since \—yh{x)\ < 1 for all {x,y). 


45 









TELGARSKY DUOfK SCHAPIRE 


Combining this with standard deviation bounds for Rademacher complexity (Lemma C.l), with 
probability 1 — 5, 


max{j 4 r(;, 0 }(i/i|x)^ — J ma.x{Aw, 0 }dli^'j)^ ^ ' 

Combining Eq. (18) with Ker(/i| 2 )^)-*- D Ker(ju| 2 )^)-*-, 


sup 


ln(2d)+ln(4/5) 


n 


(18) 


Bal(/i|D^) = inf Iy maxl^ru, : ||ru||i = l,u) G Ker(/i|x)^)'*' 

, = l.»eKer(^,l,.y}-8/^S±EM 

/ln(2d) +ln(4/5) 

= Bal(/r|Dj - \ ■ 

For the last statements, suppose the provided lower bound on n; this immediately grants the 
bound Bal(^|D^) > Bal(/r|23^)/2 by the above derivation. To show Ker(/r|D^) = Ker(/i| 2 )*), it 
suffices, by the above, to show Ker(//|D^) ^ Ker(^| 2 )^). If Ker(/i|x)^)-*- = {0}, then Ker(/i|x)*)"'' C 
Ker(/r|D^)-*- immediately since the latter is a subspace, thus suppose Ker(/r| 2 )^)-*- is nontrivial. Let 
w G Ker(/r| 2 )^)-*- with ||ru||i > 0 be arbitrary; by the definition of Bal(/r| 2 )^), the lower bound on n, 
the deviation bound from Eq. (18), and since Bal(/r|D^) > 0 because /r(I)*) > 0 (see Proposi¬ 
tion J.4), 


> inf Iy maxj^r/;, 0}d/i|ii^ : ||r(;| 

> inf Iy maxj^r/;, 0}d/i|D^ : ||m| 


y maxl^dti;, 0}ci^|D^ = ||m||i max{j4t(;/||t(;||i, 

/ln(2(i) + ln(4/5)\ 


> ||m||i ^Bal(^|Bj - 8 

> ||tt)||iBal(/i|2)^)/2 > 0 


n 


V 


Since f maxl^du), 0}d/i|2)* > 0, there must exist Zi £ § with {Aw){zi) > 0, and in particular w 0 
Ker(/r|D^). To see how this gives the result, suppose contradictorily that Ker(/r|D^) C Ker(/i|2)^), 
whereby there must exist w G Ker(/i|x)^) n Ker(^|x)^)-*- with tu / 0. But the above analysis showed 
that every w G Ker(/X|2)^)-*- with ||m||i > 0 has w 0 Ker(/i|x)^), a contradiction. ■ 


Next lemma is a refined analysis of deviafions over D* under fhe assumpfion (. G In 

particular, Bal(/i| 23 ^) will be used fo esfablish sfrong convexity of 1K(-; /r|D^) around w. The core 
of fhe convergence rale argumenl ilself follows almosl identically a proof by Shalev-Shwarfz el al. 
(2008, Theorem 1), wilh Iwo imporlanl differences lhal necessilafed a careful reproof. 
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• Rather than controlling a function which is strongly convex everywhere thanks to a regular- 
izer, it is instead only used that 1R(-; is inherently strongly convex around the optimum 
without any regularization. 

• This strong convexity around the optimum is only established over and in particular not 
over Of course, since Bal(/r| 2 :i^) is statistically stable (see Lemma J.IO), the same proof 
shows that 1K(-; is also strongly convex along 2?*, but it is interesting and pleasant that 
the proof works directly without establishing this. 


Lemma J.ll Let probability measure p over Z, hypotheses Tf with |Tf| < oo, loss function I G 
and canonical difficult set D* with > 0 be given. Let a primal-dual optimal pair (w, q) 

for Eq. (7) with measure be given with w G Ker(/r| 23 ^)“’“. Lastly, let B > ||u)||i be given, and 
setW := {m G Ker(/r|X)^)"*' : ||in||i < B}. The following statements hold. 


1. Set T := inf| 2|<5 E'{z) and A := rBal(/r|ii^)^, where r > 0 since I G Then, for every 
in G W, 

PivJ > ^\\w - w\\i. (19) 


2 . 


Let a draw Sfrom of size n be given. Then, with probability at least 1 — 5, every m G W 

satisfies 




1024£^(2.B)^(ln(2d) + ln(4/<5)) 
An 


Proof 


1. To start, applying Taylor’s theorem pointwise, every w £ W satisfies 

£(m;//|Dj = j i{Aw)dpi'j)„ - j (i{Aw)dp\x>^ 

> J i\A'w){Aw — Aw)dp^xi.„ + 2 J ~ 




<7 


To manage Z), since q = £'{Aw) /r-a.e. (by Theorem 2.1), and since q is dual feasible 
(whereby A~^q = 0), then 

^ = y £'{Aw){Aw — Aw)dp\D_^ = J q{A{w — w))dp\D^ = 0. 

For the second term A, by Jensen’s inequality and the definition of Bal(//|D^), 

A> ^ ( f \A{w - w)\dp.\X)„ 


> 


> 


m.cVx.{A{w — w), 0 }(i/i| 23 ^ 


rBal(//| 




Iru — wWi , 


which gives the bound. 
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2. As discussed above, this proof follows one due to Shalev-Shwartz et al. (2008, Proof of 
Theorem 1). 

To start, when v G and i G then 1R(-; v) is L := £'{2B) Lipschitz; addi¬ 

tionally, it satisfies Eq. (19), meaning 3l(-; u) is A-strongly-convex around w as above. 

Let r > 0 be a constant to be optimized at the end of the proof, and define 

kyj := min |/c G Z_|_ : £(t(;; ^|d*) < , 

fw{z) -.= £[{Aw){z)) -£[{Aw){z)), 
gy,{z) := 4"^“/^( 2 ), 

g := {5r^„ : m G W} . 


Applying Lemma C.l fo 9, then with probability at least 1 — 5, each m G W satisfies 


gwdix\D^< / + 2fn(g) +4 sup \gyj{z)\\ 

J w&w,zez V 


21n(4/5) 


n 


( 20 ) 


<> 


Lollowing the proof scheme of Shalev-Shwartz et al. (2008, Proof of Theorem 1), the two 
critical terms are bounded as follows. 

• Lirst, 0 = sup^g-^y ^ez. \9wiz)\ < L^^rjX as follows. Lor any re G W and any z G 2., 
by the fact that fR(-; is L-Lipschitz and satisfies Eq. (19), since E.{w; 
by definition of > 0, 


\gw{z)\ =4 e{{Aw){z)) -e{{Aw){z)) 


< 4 - r()||i 


< 4-'^»Ly^2£(u;;/rpJ/A 

< 4“^“Ly^2r4*'™/A 
= 4-fc»/22,y^2r/A 

< L\/2r jX 


as desired. 

• Second, 4k = 91(g) < 4Ly^r ln(2d)/(An). Lor this, first define fwo helper classes 

J(a) := {fiu : m G W, £(m;//|Dj < 0 } , 

T(a) := : w G W, ||r(; — 'u)||i < y/2a/A| . 
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By Eq. (19), G 3'(a) implies \\w — < y/2ajX, thus ^(a) C 5'(a). By various 

properties of Rademacher complexity from Lemma C.2, 

91(J(a)) < lH(J(a)) 

< Ly{{{z i-A {Aw){z) : w G Ker(/i| 2 )^)-*-, ||m — w\\i < \/2a/A}) 

< L9l({z (-A {Aw){z) : w G Ker(/i| 23 ^)“'“, ||m||i < y/2a/A}) 


< L 


4a ln(2(i) 
An 


To control 94(3) first note that 0 G S and 0 G ^(a) for any a > 0, since these sets 
all consider the choice ih G W. Consequently, Lemma C.2.ii may be applied, which 
together with Lemma C.2.i yields 


91(S) < 91 (u^=o4“^^(r4^)) < 4-^91 (3'(r4*=)) . 

k=0 

This completes the bound on 91(3), since the above estimates grant 

OO 

^4-*^ 91 


r4^ 


< ^^^ 4rln(2d) ^ 4-^/2 ^ ln(2d) 


An 


fc=0 k=0 

Continuing with the deviation bound in Eq. (20), set r with foresight as 

.2 f ln(2(i) + ln(4/(f) A 


An 


r := 8192L" 




An 


■) 


Now combining the preceding inequalities on <0> and 4k, the choice of r, and the general 
inequality yAi + y/b < -\/2(a~+~6) over nonnegative reals, it follows for every m G W that 

- £(w;;/r|Dj = - 9J(n);/i|Dj - /2|2)J - ^m^^ 3l(m; /i|D 

< 9?(n;; ^|i,J - 9J(n); /ipj - (9?(n;; - 9J(m; /ipj) 

= 4 ^“ ( / gwdn\x,^ - / gwdjl\x:^ 




An 


An 


<#-V?-8Lv/2.,/hM+MM 

V An 


j 




To finish the proof, consider two cases for the value of kw- either = 0, or > 0. When 
kyj = 0, then the choice of r gives 

n. ^ ^ N ^4° 1024L2(ln(2d)+ ln(4/(5)) 

8(n;;/i|2)J - < ^ =- 
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which yields the desired inequality since L < and by adding £(■(/;;/i|x)*) > Oto the right 
hand side. On the other hand, when > 0, then the definition of k^, implies £(tc; /r|D^) > 
4^^-ir, which plugged back into the above gives 

meaning £(t(;; < 2£(m; /r|D*)> giving the desired bound. 


Proof (of Lemma 3.10) This proof will be focused on parts (ii) and (iii); to start, note the following 
two supporting results, the second of which will establish part (i) of the desired statement along the 
way. 


• First note how inf.^g]gd lk(u); v) can be related for v G {F|d*) F|d*}- By Proposition J.4 and 
the assumption /i(D*) > 0, Bal(//|D^) > 0, and thus Lemma J.5 gives a primal optimum 
w G Ker(/r| 2 )^)-*- with ||u)||i < £(0)/(sBal(/X|j)^)). Consequently, by Hoeffding’s inequality 
applied to a random variable with range ^(||t(;||i), with probability at least 1 — 5, 

inf fR(u;d/i| 2 ) ) < • 

V n 

= inf +£(||u)||i)r/^^^^^^^, 

vem<i V n 

which will be useful via the rearrangement 

- inf d/rp ) < - inf 3^(u;d/2|i, ) + ^(||w||i)a/ ^ ■ (21) 

• Secondly, assume the final consequence of Lemma J.IO holds, discarding along the way an¬ 
other failure event having probability at most <5: by the lower bound on n, Ker(|U| 2 )^) = 
Ker(/ 2 | 2 )^) and 

2Bal(/i|i,J > Bal(/i|i,J. (22) 

To see the value of Ker(^| 2 )^) = Ker(/2|ii^), given any w G henceforth write w = 
Wo © w± with wq G Ker(/i| 2 )^) and w± G Ker(/i|ii^)-*-, where additionally wq G Ker(/i| 23 ^) 
and w± G Ker(/I| 2 )^)-*-. As a first consequence, 

{Aw){z) = {Awo){z) + {Aw±){z) = {Aw±){z) for /r-a.e. and /r-a.e. z G D*. (23) 

which further implies 

'Jl{w±;u) = Jl{w;u) and E-{w±;u) = E-{w;i') for G {/i|D^,(24) 
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Secondly, by Lemma J. 1 applied to measure (where Lemma J. 1 requires w_\_ G Ker(/i| 2 )^ )-*-), 
and also using Eq. (22), Eq. (24), and the form of B^, 

\\w±\\i < ^ ^ (25) 

sBal(7r|2)^) sBal(/r|2)^) 

This last inequality is essential as it allows ||in_L||i and to be related, the latter 

being a purely sample-dependent quantity. In particular, part (i) follows immediately by 
combining Eq. (23) and Eq. (25); that is, for /r-a.e. z G D* and ju-a.e. z G D*, |(zlt(;)( 2 ;)| = 
|(Am_L)(5;)| < 1111^x111 < B^. 

The remainder of the proof will establish parts (ii) and (hi) by organizing into sets (Wj)j>i 
with = Uj>i Wj. Eor every integer i > 1 define 


Ri :=i + ||u;||i, 

Wj := |r(; G , 

6i ■.= 6/{i + lf. 


By this choice, •^hus proving both types of bound contributes to the final 25 in the 

full statement. Secondly, note how Eq. (25) gives a way to use ^\d^) to choose i with w G Wj: 
the largest i granting m G Wj satisfies f < 1 — Umlli + HictIIi < B^ — 1 — ||m||i. 

With this structure in place, parts (ii) and (iii) are established for each i G Z++ separately as 
follows, and the general bounds follow by replacing the term Ri via 

Ri<i + llmlli < - 1 - ||m||i^ + Umlli = 5^-1. 

Note that, restricted to Wj, ^ satisfies /r-a.e. boundededness in the sense that sup^g'vy, i{{Aw) (z)) < 
i{Ri) for /r-a.e. z G D* and also /2-a.e. z G 2)* (by part (i)), and £ is Eipschitz with constant l'{Ri). 


(ii) Using Rademacher complexity of Eipschitz functions (Eemmas C. 1 and C.2), and Eq. (24) 
to swap w and w±, and noting the general inequality ^/a + y/b < y/2a + 2b for nonnegative 
reals, then for any fixed i, with probability at least 1 — 5i, every w G Wj satisfies 


< 2i'{Ri)Riy/2ln{2d)/n + 4i{Ri)y/2lni4/5i)/n 


<8{£'{Ri)R^ + £{Ri)) 

< 8£{2Ri) 


ln{2d) + ln(4/(5j 


n 


ln(2d) + ln(4/5i) 


n 


where the last simplification used i{Ri) +£'{Ri){2Ri — Ri) < £{2Ri) by convexity. To finish 
the proof, combining the above display with Eq. (21) gives 


£(m;/imj - = 5l(m;7rmJ - D?(m;/imJ - inf inf 3J(n;/imJ 


< 5l(m;7r|Dj - + £{\\w\\i] 


21n(l/5) 


n 


< 10£{2Ri 


ln{2d) + \n.{4/5i) 


n 
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(iii) Similarly to the purely Lipschitz case above, but now using Lemma J. 11 to control deviations, 
with probability at least 1 — 6i, each w G Wj satisfies 


< 28 (n;_L;^|Dj - 
= 28(n;;^|Dj + 


1024f(2i2i)^(ln(2(i) +ln(4/(5i)) 
nr(i2j)Bal(p|Dj2 

1024f (2i?i)2(ln(2d) + ln(4/(5,)) 

?T.r(i?i)Bal(/i|23j2 


J.5. Proof of Theorem 1.2 

Before proving Theorem 1.2, note briefly how samples drawn from /r can be treated as a draw from 
/^|D* and /r|2)c. 

Lemma J.12 Let probability measure p and a canonical difficult set D* be given. Let § denote a 
draw from p of size n > 81n(l/(5), and define := SnD*and§c := with sizes nq) := |§d| 

and ric ■= |Sc|- Then with probability at least 1—5 over the draw of§, 

n-Q > npfDfj/2, ric > np{T>l)/2, 

and Sx) and Sc can be treated as draws of size nq}_^ and ric from P\Di, T\l>p respectively. 

Proof Treating the partitioned sample as two independent draws is the usual rejection sampling. 
Moreover, by multiplicative Chemoff bounds (Kearns and Vazirani, 1994, Theorem 9.2) and the 
lower bound on n. 


raD*/n = p{T>^) > p{T)fj ^1 - ^2 ln(l/5)/n^ > p(D*)/2, 
Uc/n = p{Vl) > p{Vl) (^1 - v^2 ln(l/5)/n) > p(2)*)/2. 


All the pieces are in place to prove Theorem 1.2. 

Proof (of Theorem 1.2) To prove the bound, set 5' := Sjl, and let a sample S be given with size 


n > 81n(l/(5') + l[p{T>fj > 0] 


/512(ln(2d) + ln(4/(5'))\ 

V /r(D*)Bal(/rpj2 ) 


0 (ln(l/ 5 '))- 


By Lemma J.12, conditioning away a first failure probability of 5', /u(D*) > 0 implies the set 
Sd = S n D* is an i.i.d. draw from p\q)^ of size nq, = Ll{n) satisfying moreover 


256(ln(2d) + ln(4/5')) 

“ Bal(/i|x)J2 


(26) 
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whereas > 0 implies the set §c = § H 2)* is an i.i.d. draw from of size ric = 

Let td G be arbitrary, and note 


lilw - vldfj-= / \r]w - vldfj-+ / Ir/u, - r?|d/i; 
dD* Jvz 


( 27 ) 




the proof will proceed by controlling and A separately, where either term is 0 automatically if 
either /x(D*) = 0 or /r(D*) = 0, respectively. Note, throughout this proof, that D = 2)* |U-a.e. 
since £ G thanks to Proposition 3.9. 

First consider the term (when /r(D*) > 0); the goal will be to invoke Lemma 3.6, however 
many of the messy terms therein will be controlled via Lemma 3.10. In particular, assume the 
various parts of Lemma 3.10, and condition away an additional 45' failure probability, noting that tid 
is sufficiently large by Eq. (26). Let be as in the statement of Lemma 3.10, which crucially only 
depends on w only through E.n{w), and satisfies |(74m)(z)| < Bw for /r-a.e. z G 2>*. Furfhermore, 
since Bal(/i|x)^) > 0 by /i(2)*) > 0 and Lemma J.3, by Lemma J.5 fhere exisfs a primal optimum 
w wifh G d£{Aw) /r-a.e. over 2)*, and /r-a.e. z G 2)* satisfies 


|(zlt(;)(2;)| < ||ru||i < 


f(0)Bal(/r|Dj 


< Bw 


By Lemma 3.2, q{z) := q-o^{z)\[z G 2)*] is also optimal for fhe full problem over /r; but The¬ 
orem 2.1 provided that the full dual optimum is /x-a.e. unique, meaning the general q and this 
specialized q agree /r-a.e. over 2)*, and in particular ^-a.e. z G 2)* satisfies 


£'{-B^)<£'{{Aw){z)) = q{z) = l'{{Aw){z))<£'{BA. 


Consequenfly, applying Lemma 3.6 wifh consfanfs ci := B^, cq ■= inf| rl<B^£'{r) = f{-BA, 
and C 3 := sup|^|<B^ £'{r) = f (S^), we have 

fi{S.)=KS+) = KV) = 0 , 


so if suffices fo include fhe term for U. Now using Lemma 3.10.iii fo relate £(?/;; /Up^) and 
additionally fhe general inequalify \/a + 5 < y/a + y/b for nonnegative reals, and 
lasfly recalling fhe nofafion t{BA := inf|g|<B^ ^"{q) from Lemma 3.10, 




< La 


12S.{w, 

( 

t{BA I 


2£n(m,/r|2)^) + 


'I024f (2,B^)2(ln(2d) + ln(452/5')) 


nq,B3.\{fl\q,A'^T{BA 


<0 /2 £n(ri') V ^ niw ) + 


MV^Y 

n I 


where fhe term / 2 (£n(m)) collecfs all terms depending on Bw, which itself depends on w only 
through tniw) as per Lemma 3.10. 
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Now consider the term A (when > 0); the goal will be to invoke Lemma 3.5, however 

once again some terms in the bound will be handled manually via Lemma J.9. Set e := r := 

\Jtn{w), and define Sr := {z G DJ : £[{Aw){z)) > r} exactly as in Lemma 3.5, and which also 
appears in Lemma J.9; applying Lemma J.9 with this e to m (where e > 0 since £(m; d/rD*) > 0 by 
> 0 and the assumed lower bound on Uc and since i G and discarding an additional 25' 

failure probability along the way, fi{Sr) < 0 (In(nc) + /5')) /. Combining 

this bound on ^{Sr) with the bound on |? 7 ^ — fi\ from Lemma 3.5 (which uses the fact that D = 
/r-a.e. since I G gives 


A </r(S’^) + r/i(!D=\5^)max|^,^| = 0 

Plugging these bounds on A and back into Eq. (27) gives the desired inequality. 

Lastly, the convergence statement is, as usual, a consequence of the Borel-Cantelli lemma. In 
particular, let e > 0 be arbitrary, set 5n ■= 1/n^, and define fhe evenf 

En,e ■ = 


j \riwr. - vW > e 



Applying fhe bound above for each Wn, as n — oo and tn{wn) —)• 0, we obfain thaf fhere exisfs 
some N so fhaf every n > N has Pr(i7„ g) < 5n- Consequenfly, 

N 

5n < OO, 

n>l n=\ n>N 

and fhe resulf follows by applying fhe Borel-Canfelli lemma. ■ 
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